• Context:
A telecom company wants to use their historical customer data to predict behaviour to retain customers. You can analyse all relevant customer data and develop focused customer retention programs.
• Data Description:
Each row represents a customer, each column contains customer’s attributes described on the column Metadata. The data set includes information about:
• Customers who left within the last month – the column is called Churn
• Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies
• Customer account information – how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges
• Demographic info about customers – gender, age range, and if they have partners and dependents
• Project Objective:
Build a model that will help to identify the potential customers who have a higher probability to churn. This help the company to understand the pinpoints and patterns of customer churn and will increase the focus on strategising customer retention.
• Steps to the project: [ Total score: 60 points ]
Customer Churn or Customer Turnover refers to when a Customer ceases services with a company. Churn Prediction is a subset of problem that can be extend to many area such as employees in a company, Customer Churn from a mobile subscription etc. We are going to use the Telecom data to predict Churn. After loading the the data, we will explore attributes and different relationships between them before building our model.
Customer attrition, also known as customer churn, customer turnover, or customer defection, is the loss of clients or customers.
Telephone service companies, Internet service providers, pay TV companies, insurance firms, and alarm monitoring services, often use customer attrition analysis and customer attrition rates as one of their key business metrics because the cost of retaining an existing customer is far less than acquiring a new one. Companies from these sectors often have customer service branches which attempt to win back defecting clients, because recovered long-term customers can be worth much more to a company than newly recruited clients.
Companies usually make a distinction between voluntary churn and involuntary churn. Voluntary churn occurs due to a decision by the customer to switch to another company or service provider, involuntary churn occurs due to circumstances such as a customer's relocation to a long-term care facility, death, or the relocation to a distant location. In most applications, involuntary reasons for churn are excluded from the analytical models. Analysts tend to concentrate on voluntary churn, because it typically occurs due to factors of the company-customer relationship which companies control, such as how billing interactions are handled or how after-sales help is provided.
Predictive Analytics use churn prediction models that predict customer churn by assessing their propensity of risk to churn. Since these models generate a small prioritized list of potential defectors, they are effective at focusing customer retention marketing programs on the subset of the customer base who are most vulnerable to churn.
1. Volume:
Observation:
1) The data set has 21 Rows X 7043 columns total of 147903 entries
2) Out of 21 columns 17 columns have 2-4 unique data predominantly as binary values, "Yes", "No", "Male", "Female",
Recommendation:
1) Since the dataset is all about Telecom Churn additional features like the following attributing to the Churn would have added more value
A) Reasons for Churn
B) Customer Segmentation [Consumer, MSME, SME, Corporate, Key Account]
C) Geographic Location to show Marketing segmentation [High ARPU, Low ARPU]
D) Outstanding collections/ receiveables from the customer [0-30, 30-60, 60-90, >90 days]
E) Citing poor or no network coverage
2. Velocity:
Observation:
1) The dataset doesn't indicate the duration and the frequency of the data collected
2) The customer chrun is the key feature in the dataset, it is not evident that how frequently the data is shared when a customer churns
Recommendation:
1) To improve Velocity the dataset shall be refreshed when a customer churns
3. Variety:
Observation:
The dataset provides a variety of features of a customer with which we shall interpret how it relates to a churn like
1) 90% of customers had a phone connection
2) 45% of customers had multiple phone lines
3) 70% of customers had a fibre optic connection
4) 78% of customers didn't had a online security
Recommendation:
To improve consistency to the dataset following measures shall be considered
1) What lead the customer to churn (reasons)
2) How many cummulative features lead to churn
4. Variability:
Observation:
The dataset has 21 columns of which we had one feature has data type int64 and couple of features had data type as float 64
Recommendation:
To add more variablilty to the data set key features attributing to churn like CSAT Survey, trouble ticket history will provide the right essence
5. Veracity:
Observation:
In our dataset the features like Monthly charges, total charges and tenure were float and int 64 data types. The accuracy of this information will help to assess the impact due to churn
Recommendation:
To improve the accuracy of data source the tenure, monthly charges and total charges can be segmented like 0-6 months, 6months to 1 year and Payment charges like 0-500, 500-1000 etc
6. Visualization:
Observation:
For the given data set the volume isn't huge hence Box Plot, Pair Plot, Histogram and Pie Charts are quite sufficient to represent the data set
Recommendation:
If the data set is huge we can look at using Spectogram, Scatter plots, Geo Charts etc
7. Value:
Observation:
The information provided in the data set can be categorized as nice to have and some essential information was lagging
Recommendation:
To predict a chrun other vital information like the following would have added more value and lead to a data driven prediction A) Reasons for Churn
B) Customer Segmentation [Consumer, MSME, SME, Corporate, Key Account]
C) Geographic Location to show Marketing segmentation [High ARPU, Low ARPU]
D) Outstanding collections/ receiveables from the customer [0-30, 30-60, 60-90, >90 days]
E) Citing poor or no network coverage
# Added increase screen size
from IPython.display import display, HTML
display(HTML(data="""
<style>
div#notebook-container { width: 95%; }
div#menubar-container { width: 65%; }
div#maintoolbar-container { width: 99%; }
</style>
"""))
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt#visualization
from PIL import Image
%matplotlib inline
import pandas as pd
import seaborn as sns#visualization
import itertools
import io
import plotly.offline as py#visualization
py.init_notebook_mode(connected=True)#visualization
from plotly.subplots import make_subplots
import plotly.graph_objs as go#visualization
import plotly.tools as tls#visualization
import plotly.figure_factory as ff#
import warnings
from sklearn import metrics
from sklearn.ensemble import BaggingClassifier
import plotly.io as pio
from contextlib import contextmanager
@contextmanager
def timer(title):
t0 = time.time()
yield
print("{} - done in {:.0f}s".format(title, time.time() - t0))
warnings.filterwarnings('ignore') #ignore warning messages
%matplotlib inline
pio.renderers.default='notebook'
telcom = pd.read_csv("TelcomCustomer-Churn.csv")
telcom.head()
| customerID | gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | ... | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7590-VHVEG | Female | 0 | Yes | No | 1 | No | No phone service | DSL | No | ... | No | No | No | No | Month-to-month | Yes | Electronic check | 29.85 | 29.85 | No |
| 1 | 5575-GNVDE | Male | 0 | No | No | 34 | Yes | No | DSL | Yes | ... | Yes | No | No | No | One year | No | Mailed check | 56.95 | 1889.5 | No |
| 2 | 3668-QPYBK | Male | 0 | No | No | 2 | Yes | No | DSL | Yes | ... | No | No | No | No | Month-to-month | Yes | Mailed check | 53.85 | 108.15 | Yes |
| 3 | 7795-CFOCW | Male | 0 | No | No | 45 | No | No phone service | DSL | Yes | ... | Yes | Yes | No | No | One year | No | Bank transfer (automatic) | 42.30 | 1840.75 | No |
| 4 | 9237-HQITU | Female | 0 | No | No | 2 | Yes | No | Fiber optic | No | ... | No | No | No | No | Month-to-month | Yes | Electronic check | 70.70 | 151.65 | Yes |
5 rows × 21 columns
The dataset has 21 attributes and below is the definition:
In this section, we will first do an exploratory data analysis by exploring most attributes and check their contribution or how they are related to customers churn. We will follow the steps below:
Before running our Statistic, we will take a look at the Data Type.
telcom.shape
telcom.drop(labels=['customerID'], axis=1, inplace=True)
customerID Column¶Since 'customerid' column does not provide any relevant information in predicting the customer churn, we can delete the column.
print ("Rows : " ,telcom.shape[0])
print ("Columns : " ,telcom.shape[1])
print ("\nFeatures : \n" ,telcom.columns.tolist())
print ("\nMissing values : ", telcom.isnull().sum().values.sum())
print ("\nUnique values : \n",telcom.nunique())
Rows : 7043 Columns : 20 Features : ['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn'] Missing values : 0 Unique values : gender 2 SeniorCitizen 2 Partner 2 Dependents 2 tenure 73 PhoneService 2 MultipleLines 3 InternetService 3 OnlineSecurity 3 OnlineBackup 3 DeviceProtection 3 TechSupport 3 StreamingTV 3 StreamingMovies 3 Contract 3 PaperlessBilling 2 PaymentMethod 4 MonthlyCharges 1585 TotalCharges 6531 Churn 2 dtype: int64
telcom.describe()
| SeniorCitizen | tenure | MonthlyCharges | |
|---|---|---|---|
| count | 7043.000000 | 7043.000000 | 7043.000000 |
| mean | 0.162147 | 32.371149 | 64.761692 |
| std | 0.368612 | 24.559481 | 30.090047 |
| min | 0.000000 | 0.000000 | 18.250000 |
| 25% | 0.000000 | 9.000000 | 35.500000 |
| 50% | 0.000000 | 29.000000 | 70.350000 |
| 75% | 0.000000 | 55.000000 | 89.850000 |
| max | 1.000000 | 72.000000 | 118.750000 |
telcom.isnull().sum()
gender 0 SeniorCitizen 0 Partner 0 Dependents 0 tenure 0 PhoneService 0 MultipleLines 0 InternetService 0 OnlineSecurity 0 OnlineBackup 0 DeviceProtection 0 TechSupport 0 StreamingTV 0 StreamingMovies 0 Contract 0 PaperlessBilling 0 PaymentMethod 0 MonthlyCharges 0 TotalCharges 0 Churn 0 dtype: int64
telcom.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 7043 entries, 0 to 7042 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 gender 7043 non-null object 1 SeniorCitizen 7043 non-null int64 2 Partner 7043 non-null object 3 Dependents 7043 non-null object 4 tenure 7043 non-null int64 5 PhoneService 7043 non-null object 6 MultipleLines 7043 non-null object 7 InternetService 7043 non-null object 8 OnlineSecurity 7043 non-null object 9 OnlineBackup 7043 non-null object 10 DeviceProtection 7043 non-null object 11 TechSupport 7043 non-null object 12 StreamingTV 7043 non-null object 13 StreamingMovies 7043 non-null object 14 Contract 7043 non-null object 15 PaperlessBilling 7043 non-null object 16 PaymentMethod 7043 non-null object 17 MonthlyCharges 7043 non-null float64 18 TotalCharges 7043 non-null object 19 Churn 7043 non-null object dtypes: float64(1), int64(2), object(17) memory usage: 1.1+ MB
As of now we don't see any null values. However, we will find a few in the TotalCharges column after casting it to float64
#Data Manipulation
#Replacing spaces with null values in total charges column
telcom['TotalCharges'] = telcom["TotalCharges"].replace(" ",np.nan)
#Dropping null values from total charges column which contain .15% missing data
telcom = telcom[telcom["TotalCharges"].notnull()]
telcom = telcom.reset_index()[telcom.columns]
#convert to float type
telcom["TotalCharges"] = telcom["TotalCharges"].astype(float)
#replace 'No internet service' to No for the following columns
replace_cols = [ 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
'TechSupport','StreamingTV', 'StreamingMovies']
for i in replace_cols :
telcom[i] = telcom[i].replace({'No internet service' : 'No'})
#replace values
telcom["SeniorCitizen"] = telcom["SeniorCitizen"].replace({1:"Yes",0:"No"})
#Tenure to categorical column
def tenure_lab(telcom) :
if telcom["tenure"] <= 12 :
return "Tenure_0-12"
elif (telcom["tenure"] > 12) & (telcom["tenure"] <= 24 ):
return "Tenure_12-24"
elif (telcom["tenure"] > 24) & (telcom["tenure"] <= 48) :
return "Tenure_24-48"
elif (telcom["tenure"] > 48) & (telcom["tenure"] <= 60) :
return "Tenure_48-60"
elif telcom["tenure"] > 60 :
return "Tenure_gt_60"
telcom["tenure_group"] = telcom.apply(lambda telcom:tenure_lab(telcom),
axis = 1)
#Separating churn and non churn customers
churn = telcom[telcom["Churn"] == "Yes"]
not_churn = telcom[telcom["Churn"] == "No"]
#Separating catagorical and numerical columns
Id_col = ['customerID']
target_col = ["Churn"]
cat_cols = telcom.nunique()[telcom.nunique() < 6].keys().tolist()
cat_cols = [x for x in cat_cols if x not in target_col]
num_cols = [x for x in telcom.columns if x not in cat_cols + target_col + Id_col]
It can also be noted that the Tenure column is 0 for these entries even though the MonthlyCharges column is not empty. Let's see if there are any other 0 values in the Tenure column.
telcom.head()
| gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | OnlineBackup | ... | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | tenure_group | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Female | No | Yes | No | 1 | No | No phone service | DSL | No | Yes | ... | No | No | No | Month-to-month | Yes | Electronic check | 29.85 | 29.85 | No | Tenure_0-12 |
| 1 | Male | No | No | No | 34 | Yes | No | DSL | Yes | No | ... | No | No | No | One year | No | Mailed check | 56.95 | 1889.50 | No | Tenure_24-48 |
| 2 | Male | No | No | No | 2 | Yes | No | DSL | Yes | Yes | ... | No | No | No | Month-to-month | Yes | Mailed check | 53.85 | 108.15 | Yes | Tenure_0-12 |
| 3 | Male | No | No | No | 45 | No | No phone service | DSL | Yes | No | ... | Yes | No | No | One year | No | Bank transfer (automatic) | 42.30 | 1840.75 | No | Tenure_24-48 |
| 4 | Female | No | No | No | 2 | Yes | No | Fiber optic | No | No | ... | No | No | No | Month-to-month | Yes | Electronic check | 70.70 | 151.65 | Yes | Tenure_0-12 |
5 rows × 21 columns
There are no additional missing values in the Tenure column. Let's delete the rows with missing values in MonthlyCharges and tenure columns.
#Splitting the dataset into features and target
target_raw=telcom['Churn']
features=telcom.drop('Churn',axis=1)
data_cat = ['gender', 'SeniorCitizen', 'Partner', 'Dependents',
'PhoneService', 'MultipleLines', 'InternetService',
'OnlineSecurity', 'OnlineBackup', 'DeviceProtection',
'TechSupport','StreamingTV','StreamingMovies',
'Contract', 'PaperlessBilling','PaymentMethod']
fig , ax = plt.subplots(4,4,figsize=(15,15))
sns.set(style="ticks", color_codes=True)
for axis,col in zip(ax.flat,data_cat):
sns.countplot(x=telcom["Churn"],hue=telcom[col],ax=axis)
Plot insights:
Senior citizens churn rate is much higher than non-senior churn rate.
Churn rate for month-to-month contracts much higher that for other contract durations.
Moderately higher churn rate for customers without partners.
Much higher churn rate for customers without children.
Payment method electronic check shows much higher churn rate than other payment methods.
Customers with InternetService fiber optic as part of their contract have much higher churn rate.
#labels
lab = telcom["Churn"].value_counts().keys().tolist()
#values
val = telcom["Churn"].value_counts().values.tolist()
trace = go.Pie(labels = lab ,
values = val ,
marker = dict(colors = [ 'royalblue' ,'lime'],
line = dict(color = "white",
width = 1.3)
),
rotation = 90,
hoverinfo = "label+value+text",
hole = .5
)
layout = go.Layout(dict(title = "Customer attrition in data",
plot_bgcolor = "rgb(243,243,243)",
paper_bgcolor = "rgb(243,243,243)",
)
)
data = [trace]
fig = go.Figure(data = data,layout = layout)
py.iplot(fig)
meanTotalCharge = telcom.TotalCharges.mean()
telcom['TotalCharges']=telcom['TotalCharges'].fillna(meanTotalCharge)
telcom.groupby('Churn').mean()
| tenure | MonthlyCharges | TotalCharges | |
|---|---|---|---|
| Churn | |||
| No | 37.650010 | 61.307408 | 2555.344141 |
| Yes | 17.979133 | 74.441332 | 1531.796094 |
As we can see, customers who churn seems on average to stay less in the company and have a monthly greater charges compare to those who do not churn. Their total charges is lower than customers that do not churn.
f,axes = plt.subplots(ncols=3, figsize=(17,6))
sns.distplot(telcom.tenure,kde=True,ax=axes[0], color='darkorange').set_title("Customer tenure")
axes[0].set_ylabel('No of Customers')
sns.distplot(telcom.MonthlyCharges,kde=True,ax=axes[1],color='maroon').set_title('Monthly Charges')
axes[1].set_ylabel('No of Customers')
sns.distplot(telcom.TotalCharges,kde=True,ax=axes[2]).set_title('Total Charges')
axes[2].set_ylabel('No of Customers')
;
''
From the observations above, it looks like:
plt.figure(figsize=(8,4))
p=sns.countplot(x="gender", hue="Churn", data=telcom)
ax=plt.gca()
for p in ax.patches:
height = p.get_height()
ax.text(p.get_x()+p.get_width()/2.,height+2, '{:.2f}%'.format(100*(height/telcom.shape[0])),fontsize=12,ha='center',va='bottom')
sns.set(font_scale=1.5)
plt.title('Churn Distribution by gender', fontweight="bold");
There are more male customer than female customers. But box sexes seems to churn with the same percentage.
plt.figure(figsize=(10,5))
p=sns.countplot(x='Contract',hue='Churn',data=telcom)
ax=plt.gca()
for p in ax.patches:
height = p.get_height()
ax.text(p.get_x()+p.get_width()/2.,height+2,'{:.2f}%'.format(100*(height/telcom.shape[0])),fontsize=14,ha='center',va='bottom')
sns.set(font_scale=1.5)
plt.title('Churn Distribution by Contract', fontweight='bold')
;
''
Most customers are month to month customers, they churn more than customers who subscribe for one year or two years contrats.
plt.figure(figsize=(15,4))
p=sns.countplot(x='PaymentMethod',hue='Churn', data=telcom)
plt.title ('Churn by payment method', fontweight='bold')
;
''
Customers who pay by Electronic check seems to churm more than customers who pay by mailed check, bank transfer or credit card. Mailed check, bank transfer or credit card customers seems to churn in about the same rate.
plt.figure(figsize=(15,4))
ax = sns.kdeplot(telcom.loc[(telcom['Churn']=='No'),'MonthlyCharges'],shade=True,label='No Churn')
ax = sns.kdeplot(telcom.loc[(telcom['Churn']=='Yes'),'MonthlyCharges'],shade=True,label='Churn')
ax.set(xlabel='Customer Montly Charges',ylabel='Frequency')
plt.title('Customer Monthly Charges - Churn vs No Churn', fontweight='bold');
#
Customers who are charged less that 40 a month seems to churn less. As the monthly rate increase, they churn more. Customers who churn the most pay between 70-100 a month.
Customers who pay by Electronic check seems to churm more than customers who pay by mailed check, bank transfer or credit card. Mailed check, bank transfer or credit card customers seems to churn in about the same rate.
plt.figure(figsize=(15,4))
ax=sns.kdeplot(telcom.loc[(telcom['Churn']=='No'),'TotalCharges'], shade=True, label='No Churn')
ax=sns.kdeplot(telcom.loc[(telcom['Churn']=='Yes'),'TotalCharges'], shade=True, label='Churn')
ax.set(xlabel='Customer Total Charges',ylabel='Frequency')
plt.title('Customer Total Charges - Churn vs No Churn', fontweight='bold');
Customers who have a total balance less than 1500 seems to churn more than customers with higher balance.
plt.figure(figsize=(8,4))
p=sns.countplot(x='SeniorCitizen',hue='Churn', data=telcom)
plt.title ('Churn by SeniorCitizen', fontweight='bold')
;
''
Non senior citizens churn more that senior citizens
telcom.head()
| gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | OnlineBackup | ... | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | tenure_group | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Female | No | Yes | No | 1 | No | No phone service | DSL | No | Yes | ... | No | No | No | Month-to-month | Yes | Electronic check | 29.85 | 29.85 | No | Tenure_0-12 |
| 1 | Male | No | No | No | 34 | Yes | No | DSL | Yes | No | ... | No | No | No | One year | No | Mailed check | 56.95 | 1889.50 | No | Tenure_24-48 |
| 2 | Male | No | No | No | 2 | Yes | No | DSL | Yes | Yes | ... | No | No | No | Month-to-month | Yes | Mailed check | 53.85 | 108.15 | Yes | Tenure_0-12 |
| 3 | Male | No | No | No | 45 | No | No phone service | DSL | Yes | No | ... | Yes | No | No | One year | No | Bank transfer (automatic) | 42.30 | 1840.75 | No | Tenure_24-48 |
| 4 | Female | No | No | No | 2 | Yes | No | Fiber optic | No | No | ... | No | No | No | Month-to-month | Yes | Electronic check | 70.70 | 151.65 | Yes | Tenure_0-12 |
5 rows × 21 columns
An outlier is a value that lies at an abnormally high distance from other values in the dataset. It can be much smaller or much larger. Basically, it doe not show the same pattern as other values. We will be using interquartile range(IQR) to detect outliers. The interquartile range is te range between the first quartile(Q1) and the third quartile (Q3). With this approach, any value which is more than 1.5 IQR+Q3 or less than Q1 - 1.5 IQR is considered as outlier. We will check the outlier in price.
def percent_outlier(telcom):
Q1 = np.percentile(telcom,25)
Q3 = np.percentile(telcom,75)
IQR = Q3-Q1
lower_bound = Q1-(IQR*1.5)
upper_bound = Q3+(IQR*1.5)
return (lower_bound,upper_bound)
We are going to draw the boxplot for the tenure column and get the outlier list.
ax = sns.boxplot(y='tenure',x='Churn',data=telcom)
ax.set_title('Tenure box plot by Churn')
;
''
lowerbound,upperbound=percent_outlier(telcom.tenure)
tenureout=[x for x in telcom.tenure if (x<lowerbound) or (x>upperbound)]
tenureout
print ("All tenure value less than {0} and more than {1} are considered outliers".format(lowerbound,upperbound))
print("The min tenure is ",min(telcom.tenure))
print("The max tenure is ",max(telcom.tenure))
All tenure value less than -60.0 and more than 124.0 are considered outliers The min tenure is 1 The max tenure is 72
lowerbound,upperbound=percent_outlier(telcom.MonthlyCharges)
tenureout=[x for x in telcom.tenure if (x<lowerbound) or (x>upperbound)]
tenureout
print ("All monthly charges less than {0} and more than {1} are considered outliers".format(lowerbound,upperbound))
print("The min monthly charges is ",min(telcom.MonthlyCharges))
print("The max monthly charges is ",max(telcom.MonthlyCharges))
All monthly charges less than -45.824999999999996 and more than 171.27499999999998 are considered outliers The min monthly charges is 18.25 The max monthly charges is 118.75
lowerbound,upperbound=percent_outlier(telcom.TotalCharges)
tenureout=[x for x in telcom.tenure if (x<lowerbound) or (x>upperbound)]
tenureout
print ("All Total Charges less than {0} and more than {1} are considered outliers".format(lowerbound,upperbound))
print("The min Total Charges is ",min(telcom.TotalCharges))
print("The max Total Chargess is ",max(telcom.TotalCharges))
All Total Charges less than -4688.481250000001 and more than 8884.66875 are considered outliers The min Total Charges is 18.8 The max Total Chargess is 8684.8
Based on the method used here to detect outliers, all values seems to be in the normal range. Therefore, our dataset does not have outliers.
#function for pie plot for customer attrition types
def plot_pie(column) :
trace1 = go.Pie(values = churn[column].value_counts().values.tolist(),
labels = churn[column].value_counts().keys().tolist(),
hoverinfo = "label+percent+name",
domain = dict(x = [0,.48]),
name = "Churn Customers",
marker = dict(line = dict(width = 2,
color = "rgb(243,243,243)")
),
hole = .6
)
trace2 = go.Pie(values = not_churn[column].value_counts().values.tolist(),
labels = not_churn[column].value_counts().keys().tolist(),
hoverinfo = "label+percent+name",
marker = dict(line = dict(width = 2,
color = "rgb(243,243,243)")
),
domain = dict(x = [.52,1]),
hole = .6,
name = "Non churn customers"
)
layout = go.Layout(dict(title = column + " distribution in customer attrition ",
plot_bgcolor = "rgb(243,243,243)",
paper_bgcolor = "rgb(243,243,243)",
annotations = [dict(text = "churn customers",
font = dict(size = 13),
showarrow = False,
x = .15, y = .5),
dict(text = "Non churn customers",
font = dict(size = 13),
showarrow = False,
x = .88,y = .5
)
]
)
)
data = [trace1,trace2]
fig = go.Figure(data = data,layout = layout)
py.iplot(fig)
#function for histogram for customer attrition types
def histogram(column) :
trace1 = go.Histogram(x = churn[column],
histnorm= "percent",
name = "Churn Customers",
marker = dict(line = dict(width = .5,
color = "black"
)
),
opacity = .9
)
trace2 = go.Histogram(x = not_churn[column],
histnorm = "percent",
name = "Non churn customers",
marker = dict(line = dict(width = .5,
color = "black"
)
),
opacity = .9
)
data = [trace1,trace2]
layout = go.Layout(dict(title =column + " distribution in customer attrition ",
plot_bgcolor = "rgb(243,243,243)",
paper_bgcolor = "rgb(243,243,243)",
xaxis = dict(gridcolor = 'rgb(255, 255, 255)',
title = column,
zerolinewidth=1,
ticklen=5,
gridwidth=2
),
yaxis = dict(gridcolor = 'rgb(255, 255, 255)',
title = "percent",
zerolinewidth=1,
ticklen=5,
gridwidth=2
),
)
)
fig = go.Figure(data=data,layout=layout)
py.iplot(fig)
#function for scatter plot matrix for numerical columns in data
def scatter_matrix(df) :
df = df.sort_values(by = "Churn" ,ascending = True)
classes = df["Churn"].unique().tolist()
classes
class_code = {classes[k] : k for k in range(2)}
class_code
color_vals = [class_code[cl] for cl in df["Churn"]]
color_vals
pl_colorscale = "Portland"
pl_colorscale
text = [df.loc[k,"Churn"] for k in range(len(df))]
text
trace = go.Splom(dimensions = [dict(label = "tenure",
values = df["tenure"]),
dict(label = 'MonthlyCharges',
values = df['MonthlyCharges']),
dict(label = 'TotalCharges',
values = df['TotalCharges'])],
text = text,
marker = dict(color = color_vals,
colorscale = pl_colorscale,
size = 3,
showscale = False,
line = dict(width = .1,
color='rgb(230,230,230)'
)
)
)
axis = dict(showline = True,
zeroline = False,
gridcolor = "#fff",
ticklen = 4
)
layout = go.Layout(dict(title =
"Scatter plot matrix for Numerical columns for customer attrition",
autosize = False,
height = 800,
width = 800,
dragmode = "select",
hovermode = "closest",
plot_bgcolor = 'rgba(240,240,240, 0.95)',
xaxis1 = dict(axis),
yaxis1 = dict(axis),
xaxis2 = dict(axis),
yaxis2 = dict(axis),
xaxis3 = dict(axis),
yaxis3 = dict(axis),
)
)
data = [trace]
fig = go.Figure(data = data,layout = layout )
py.iplot(fig)
#for all categorical columns plot pie
for i in cat_cols :
plot_pie(i)
#for all categorical columns plot histogram
for i in num_cols :
histogram(i)
#scatter plot matrix
scatter_matrix(telcom)
#Separating columns to be visualized
out_cols = list(set(telcom.nunique()[telcom.nunique()<6].keys().tolist()
+ telcom.select_dtypes(include='object').columns.tolist()))
viz_cols = [x for x in telcom.columns if x not in out_cols] + ['Churn']
sns.pairplot(telcom[viz_cols], diag_kind="kde")
plt.show()
Several of the numerical data are very correlated. (Total day minutes and Total day charge), (Total eve minutes and Total eve charge), (Total night minutes and Total night charge) and lastly (Total intl minutes and Total intl charge) are alo correlated. We only have to select one of them.
#cusomer attrition in tenure groups
tg_ch = churn["tenure_group"].value_counts().reset_index()
tg_ch.columns = ["tenure_group","count"]
tg_nch = not_churn["tenure_group"].value_counts().reset_index()
tg_nch.columns = ["tenure_group","count"]
#bar - churn
trace1 = go.Bar(x = tg_ch["tenure_group"] , y = tg_ch["count"],
name = "Churn Customers",
marker = dict(line = dict(width = .5,color = "black")),
opacity = .9)
#bar - not churn
trace2 = go.Bar(x = tg_nch["tenure_group"] , y = tg_nch["count"],
name = "Non Churn Customers",
marker = dict(line = dict(width = .5,color = "black")),
opacity = .9)
layout = go.Layout(dict(title = "Customer attrition in tenure groups",
plot_bgcolor = "rgb(243,243,243)",
paper_bgcolor = "rgb(243,243,243)",
xaxis = dict(gridcolor = 'rgb(255, 255, 255)',
title = "tenure group",
zerolinewidth=1,ticklen=5,gridwidth=2),
yaxis = dict(gridcolor = 'rgb(255, 255, 255)',
title = "count",
zerolinewidth=1,ticklen=5,gridwidth=2),
)
)
data = [trace1,trace2]
fig = go.Figure(data=data,layout=layout)
py.iplot(fig)
telcom[['MonthlyCharges', 'TotalCharges','tenure',"tenure_group"]]
#scatter plot monthly charges & total charges by tenure group
def plot_tenure_scatter(tenure_group,color) :
tracer = go.Scatter(x = telcom[telcom["tenure_group"] == tenure_group]["MonthlyCharges"],
y = telcom[telcom["tenure_group"] == tenure_group]["TotalCharges"],
mode = "markers",marker = dict(line = dict(color = "black",
width = .2),
size = 4 , color = color,
symbol = "diamond-dot",
),
name = tenure_group,
opacity = .9
)
return tracer
#scatter plot monthly charges & total charges by churn group
def plot_churncharges_scatter(churn,color) :
tracer = go.Scatter(x = telcom[telcom["Churn"] == churn]["MonthlyCharges"],
y = telcom[telcom["Churn"] == churn]["TotalCharges"],
mode = "markers",marker = dict(line = dict(color = "black",
width = .2),
size = 4 , color = color,
symbol = "diamond-dot",
),
name = "Churn - " + churn,
opacity = .9
)
return tracer
trace1 = plot_tenure_scatter("Tenure_0-12","#FF3300")
trace2 = plot_tenure_scatter("Tenure_12-24","#6666FF")
trace3 = plot_tenure_scatter("Tenure_24-48","#99FF00")
trace4 = plot_tenure_scatter("Tenure_48-60","#996600")
trace5 = plot_tenure_scatter("Tenure_gt_60","grey")
trace6 = plot_churncharges_scatter("Yes","red")
trace7 = plot_churncharges_scatter("No","blue")
data1 = [trace1,trace2,trace3,trace4,trace5]
data2 = [trace7,trace6]
#layout
def layout_title(title) :
layout = go.Layout(dict(title = title,
plot_bgcolor = "rgb(243,243,243)",
paper_bgcolor = "rgb(243,243,243)",
xaxis = dict(gridcolor = 'rgb(255, 255, 255)',
title = "Monthly charges",
zerolinewidth=1,ticklen=5,gridwidth=2),
yaxis = dict(gridcolor = 'rgb(255, 255, 255)',
title = "Total Charges",
zerolinewidth=1,ticklen=5,gridwidth=2),
height = 600
)
)
return layout
layout1 = layout_title("Monthly Charges & Total Charges by Tenure group")
layout2 = layout_title("Monthly Charges & Total Charges by Churn group")
fig1 = go.Figure(data = data1,layout = layout1)
fig2 = go.Figure(data = data2,layout = layout2)
py.iplot(fig1)
py.iplot(fig2)
avg_tgc = telcom.groupby(["tenure_group","Churn"])[["MonthlyCharges",
"TotalCharges"]].mean().reset_index()
#function for tracing
def mean_charges(column,aggregate) :
tracer = go.Bar(x = avg_tgc[avg_tgc["Churn"] == aggregate]["tenure_group"],
y = avg_tgc[avg_tgc["Churn"] == aggregate][column],
name = aggregate,marker = dict(line = dict(width = 1)),
text = "Churn"
)
return tracer
#function for layout
def layout_plot(title,xaxis_lab,yaxis_lab) :
layout = go.Layout(dict(title = title,
plot_bgcolor = "rgb(243,243,243)",
paper_bgcolor = "rgb(243,243,243)",
xaxis = dict(gridcolor = 'rgb(255, 255, 255)',title = xaxis_lab,
zerolinewidth=1,ticklen=5,gridwidth=2),
yaxis = dict(gridcolor = 'rgb(255, 255, 255)',title = yaxis_lab,
zerolinewidth=1,ticklen=5,gridwidth=2),
)
)
return layout
#plot1 - mean monthly charges by tenure groups
trace1 = mean_charges("MonthlyCharges","Yes")
trace2 = mean_charges("MonthlyCharges","No")
layout1 = layout_plot("Average Monthly Charges by Tenure groups",
"Tenure group","Monthly Charges")
data1 = [trace1,trace2]
fig1 = go.Figure(data=data1,layout=layout1)
#plot2 - mean total charges by tenure groups
trace3 = mean_charges("TotalCharges","Yes")
trace4 = mean_charges("TotalCharges","No")
layout2 = layout_plot("Average Total Charges by Tenure groups",
"Tenure group","Total Charges")
data2 = [trace3,trace4]
fig2 = go.Figure(data=data2,layout=layout2)
py.iplot(fig1)
py.iplot(fig2)
##copy data
tel_df = telcom.copy()
#Drop tenure column
telcom = telcom.drop(columns = "tenure_group",axis = 1)
trace1 = go.Scatter3d(x = churn["MonthlyCharges"],
y = churn["TotalCharges"],
z = churn["tenure"],
mode = "markers",
name = "Churn customers",
#text = "Id : " + churn["customerID"],
marker = dict(size = 1,color = "red")
)
trace2 = go.Scatter3d(x = not_churn["MonthlyCharges"],
y = not_churn["TotalCharges"],
z = not_churn["tenure"],
name = "Non churn customers",
#text = "Id : " + not_churn["customerID"],
mode = "markers",
marker = dict(size = 1,color= "green")
)
layout = go.Layout(dict(title = "Monthly Charges,Total Charges & Tenure in Customer Attrition",
scene = dict(camera = dict(up=dict(x= 0 , y=0, z=0),
center=dict(x=0, y=0, z=0),
eye=dict(x=1.25, y=1.25, z=1.25)),
xaxis = dict(title = "Monthly Charges",
gridcolor='rgb(255, 255, 255)',
zerolinecolor='rgb(255, 255, 255)',
showbackground=True,
backgroundcolor='rgb(230, 230,230)'),
yaxis = dict(title = "Total Charges",
gridcolor='rgb(255, 255, 255)',
zerolinecolor='rgb(255, 255, 255)',
showbackground=True,
backgroundcolor='rgb(230, 230,230)'
),
zaxis = dict(title = "Tenure",
gridcolor='rgb(255, 255, 255)',
zerolinecolor='rgb(255, 255, 255)',
showbackground=True,
backgroundcolor='rgb(230, 230,230)'
)
),
height = 700,
)
)
data = [trace1,trace2]
fig = go.Figure(data = data,layout = layout)
py.iplot(fig)
telcom.head()
| gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | OnlineBackup | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Female | No | Yes | No | 1 | No | No phone service | DSL | No | Yes | No | No | No | No | Month-to-month | Yes | Electronic check | 29.85 | 29.85 | No |
| 1 | Male | No | No | No | 34 | Yes | No | DSL | Yes | No | Yes | No | No | No | One year | No | Mailed check | 56.95 | 1889.50 | No |
| 2 | Male | No | No | No | 2 | Yes | No | DSL | Yes | Yes | No | No | No | No | Month-to-month | Yes | Mailed check | 53.85 | 108.15 | Yes |
| 3 | Male | No | No | No | 45 | No | No phone service | DSL | Yes | No | Yes | Yes | No | No | One year | No | Bank transfer (automatic) | 42.30 | 1840.75 | No |
| 4 | Female | No | No | No | 2 | Yes | No | Fiber optic | No | No | No | No | No | No | Month-to-month | Yes | Electronic check | 70.70 | 151.65 | Yes |
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
#customer id col
Id_col = ['customerID']
#Target columns
target_col = ["Churn"]
#categorical columns
cat_cols = telcom.nunique()[telcom.nunique() < 6].keys().tolist()
cat_cols = [x for x in cat_cols if x not in target_col]
#numerical columns
num_cols = [x for x in telcom.columns if x not in cat_cols + target_col + Id_col]
#Binary columns with 2 values
bin_cols = telcom.nunique()[telcom.nunique() == 2].keys().tolist()
#Columns more than 2 values
multi_cols = [i for i in cat_cols if i not in bin_cols]
# #Label encoding Binary columns
le = LabelEncoder()
for i in bin_cols :
telcom[i] = le.fit_transform(telcom[i])
#Duplicating columns for multi value columns
telcom = pd.get_dummies(data = telcom,columns = multi_cols)
#Scaling Numerical columns
std = StandardScaler()
scaled = std.fit_transform(telcom[num_cols])
scaled = pd.DataFrame(scaled,columns=num_cols)
#dropping original values merging scaled values for numerical columns
df_telcom_og = telcom.copy()
telcom = telcom.drop(columns = num_cols,axis = 1)
telcom = telcom.merge(scaled,left_index=True,right_index=True,how = "left")
telcom.head()
| gender | SeniorCitizen | Partner | Dependents | PhoneService | OnlineSecurity | OnlineBackup | DeviceProtection | TechSupport | StreamingTV | ... | Contract_Month-to-month | Contract_One year | Contract_Two year | PaymentMethod_Bank transfer (automatic) | PaymentMethod_Credit card (automatic) | PaymentMethod_Electronic check | PaymentMethod_Mailed check | tenure | MonthlyCharges | TotalCharges | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 1 | 0 | -1.280248 | -1.161694 | -0.994194 |
| 1 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0.064303 | -0.260878 | -0.173740 |
| 2 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 1 | -1.239504 | -0.363923 | -0.959649 |
| 3 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | ... | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0.512486 | -0.747850 | -0.195248 |
| 4 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 1 | 0 | -1.239504 | 0.196178 | -0.940457 |
5 rows × 29 columns
summary = (df_telcom_og[[i for i in df_telcom_og.columns if i not in Id_col]].
describe().transpose().reset_index())
summary = summary.rename(columns = {"index" : "feature"})
summary = np.around(summary,3)
val_lst = [summary['feature'], summary['count'],
summary['mean'],summary['std'],
summary['min'], summary['25%'],
summary['50%'], summary['75%'], summary['max']]
trace = go.Table(header = dict(values = summary.columns.tolist(),
line = dict(color = ['#506784']),
fill = dict(color = ['#119DFF']),
),
cells = dict(values = val_lst,
line = dict(color = ['#506784']),
fill = dict(color = ["lightgrey",'#F5F8FF'])
),
columnwidth = [200,60,100,100,60,60,80,80,80])
layout = go.Layout(dict(title = "Variable Summary"))
figure = go.Figure(data=[trace],layout=layout)
py.iplot(figure)
# Show Corelation of Plots for Corelation of Churn with Remaining features
plt.figure(figsize=(16,10))
telcom.corr()['Churn'].sort_values(ascending=False).plot(kind='bar',figsize=(20,5))
<AxesSubplot:>
telcom.head()
| gender | SeniorCitizen | Partner | Dependents | PhoneService | OnlineSecurity | OnlineBackup | DeviceProtection | TechSupport | StreamingTV | ... | Contract_Month-to-month | Contract_One year | Contract_Two year | PaymentMethod_Bank transfer (automatic) | PaymentMethod_Credit card (automatic) | PaymentMethod_Electronic check | PaymentMethod_Mailed check | tenure | MonthlyCharges | TotalCharges | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 1 | 0 | -1.280248 | -1.161694 | -0.994194 |
| 1 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0.064303 | -0.260878 | -0.173740 |
| 2 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 1 | -1.239504 | -0.363923 | -0.959649 |
| 3 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | ... | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0.512486 | -0.747850 | -0.195248 |
| 4 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 1 | 0 | -1.239504 | 0.196178 | -0.940457 |
5 rows × 29 columns
#correlation
correlation = telcom.corr()
#tick labels
matrix_cols = correlation.columns.tolist()
#convert to array
corr_array = np.array(correlation)
#Plotting
trace = go.Heatmap(z = corr_array,
x = matrix_cols,
y = matrix_cols,
colorscale = "Viridis",
colorbar = dict(title = "Pearson Correlation coefficient",
titleside = "right"
) ,
)
layout = go.Layout(dict(title = "Correlation Matrix for variables",
autosize = False,
height = 1020,
width = 950,
margin = dict(r = 0 ,l = 210,
t = 25,b = 210,
),
yaxis = dict(tickfont = dict(size = 9)),
xaxis = dict(tickfont = dict(size = 9))
)
)
data = [trace]
fig = go.Figure(data=data,layout=layout)
py.iplot(fig)
Inference: There is some correlation between 'phone service' and 'multiple lines' since those who don't have a phone service cannot have multiple lines. So, knowing that a particular customer is not subscribed to phone service we can infer that the customer doesn't have multiple lines. Similarly, there is also a correlation between 'internet service' and 'online security', 'online backup', 'device protection', 'streaming tv' and 'streaming movies'
telcom.head()
| gender | SeniorCitizen | Partner | Dependents | PhoneService | OnlineSecurity | OnlineBackup | DeviceProtection | TechSupport | StreamingTV | ... | Contract_Month-to-month | Contract_One year | Contract_Two year | PaymentMethod_Bank transfer (automatic) | PaymentMethod_Credit card (automatic) | PaymentMethod_Electronic check | PaymentMethod_Mailed check | tenure | MonthlyCharges | TotalCharges | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 1 | 0 | -1.280248 | -1.161694 | -0.994194 |
| 1 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0.064303 | -0.260878 | -0.173740 |
| 2 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 1 | -1.239504 | -0.363923 | -0.959649 |
| 3 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | ... | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0.512486 | -0.747850 | -0.195248 |
| 4 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 1 | 0 | -1.239504 | 0.196178 | -0.940457 |
5 rows × 29 columns
#separating binary columns
bi_cs = telcom.nunique()[telcom.nunique() == 2].keys()
dat_rad = telcom[bi_cs]
#plotting radar chart for churn and non churn customers(binary variables)
def plot_radar(df,aggregate,title) :
data_frame = df[df["Churn"] == aggregate]
data_frame_x = data_frame[bi_cs].sum().reset_index()
data_frame_x.columns = ["feature","yes"]
data_frame_x["no"] = data_frame.shape[0] - data_frame_x["yes"]
data_frame_x = data_frame_x[data_frame_x["feature"] != "Churn"]
#count of 1's(yes)
trace1 = go.Scatterpolar(r = data_frame_x["yes"].values.tolist(),
theta = data_frame_x["feature"].tolist(),
fill = "toself",name = "count of 1's",
mode = "markers+lines",
marker = dict(size = 5)
)
#count of 0's(No)
trace2 = go.Scatterpolar(r = data_frame_x["no"].values.tolist(),
theta = data_frame_x["feature"].tolist(),
fill = "toself",name = "count of 0's",
mode = "markers+lines",
marker = dict(size = 5)
)
layout = go.Layout(dict(polar = dict(radialaxis = dict(visible = True,
side = "counterclockwise",
showline = True,
linewidth = 2,
tickwidth = 2,
gridcolor = "white",
gridwidth = 2),
angularaxis = dict(tickfont = dict(size = 10),
layer = "below traces"
),
bgcolor = "rgb(243,243,243)",
),
paper_bgcolor = "rgb(243,243,243)",
title = title,height = 700))
data = [trace2,trace1]
fig = go.Figure(data=data,layout=layout)
py.iplot(fig)
#plot
plot_radar(dat_rad,1,"Churn - Customers")
plot_radar(dat_rad,0,"Non Churn - Customers")
Data needs to be Label before applying machine learning models. Feature engineering In this section, we will find the feature that are more predictive for our model. Before proceed to our features engineering, we are going to map all the string boolean to numeric boolean
telcom.head()
| gender | SeniorCitizen | Partner | Dependents | PhoneService | OnlineSecurity | OnlineBackup | DeviceProtection | TechSupport | StreamingTV | ... | Contract_Month-to-month | Contract_One year | Contract_Two year | PaymentMethod_Bank transfer (automatic) | PaymentMethod_Credit card (automatic) | PaymentMethod_Electronic check | PaymentMethod_Mailed check | tenure | MonthlyCharges | TotalCharges | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 1 | 0 | -1.280248 | -1.161694 | -0.994194 |
| 1 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0.064303 | -0.260878 | -0.173740 |
| 2 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 1 | -1.239504 | -0.363923 | -0.959649 |
| 3 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | ... | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0.512486 | -0.747850 | -0.195248 |
| 4 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 1 | 0 | 0 | 0 | 0 | 1 | 0 | -1.239504 | 0.196178 | -0.940457 |
5 rows × 29 columns
x = telcom.iloc[:, :-1]
y = telcom['Churn']
categorical_columns = list(x.select_dtypes(include='category').columns)
numeric_columns = list(x.select_dtypes(exclude='category').columns)
categorical_columns
[]
pd.set_option('display.max_columns', None)
telcom.head()
| gender | SeniorCitizen | Partner | Dependents | PhoneService | OnlineSecurity | OnlineBackup | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | PaperlessBilling | Churn | MultipleLines_No | MultipleLines_No phone service | MultipleLines_Yes | InternetService_DSL | InternetService_Fiber optic | InternetService_No | Contract_Month-to-month | Contract_One year | Contract_Two year | PaymentMethod_Bank transfer (automatic) | PaymentMethod_Credit card (automatic) | PaymentMethod_Electronic check | PaymentMethod_Mailed check | tenure | MonthlyCharges | TotalCharges | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | -1.280248 | -1.161694 | -0.994194 |
| 1 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0.064303 | -0.260878 | -0.173740 |
| 2 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | -1.239504 | -0.363923 | -0.959649 |
| 3 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0.512486 | -0.747850 | -0.195248 |
| 4 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | -1.239504 | 0.196178 | -0.940457 |
For other categorical features, we will do a label encoding to transform them to binary. For each variable that has n features, we will create n-1 features.Basically Label encoding creates a dummy feature for each unique value in the nominal feature and assign 1 if it has a value and 0 otherwise.
telcom.head()
| gender | SeniorCitizen | Partner | Dependents | PhoneService | OnlineSecurity | OnlineBackup | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | PaperlessBilling | Churn | MultipleLines_No | MultipleLines_No phone service | MultipleLines_Yes | InternetService_DSL | InternetService_Fiber optic | InternetService_No | Contract_Month-to-month | Contract_One year | Contract_Two year | PaymentMethod_Bank transfer (automatic) | PaymentMethod_Credit card (automatic) | PaymentMethod_Electronic check | PaymentMethod_Mailed check | tenure | MonthlyCharges | TotalCharges | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | -1.280248 | -1.161694 | -0.994194 |
| 1 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0.064303 | -0.260878 | -0.173740 |
| 2 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | -1.239504 | -0.363923 | -0.959649 |
| 3 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0.512486 | -0.747850 | -0.195248 |
| 4 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | -1.239504 | 0.196178 | -0.940457 |
In this section, we are going to build our first model. We are going to choose find different machine algorithms to train our base model using all features, then select the one that perform well to tune in order to have better accuracy.
This is a classification problem, we want to predict whether or not a customer will churn. Here are the classifications that we will explore:
We are going to keep 70% of data for training and 30% for testing. Based on our analysis above, we saw that about 73% of customers did not churn and about 27%, which is somewhat unbalanced. We are going to add the argument stratify=y to make sure that both training and test datasets have the same class proportions as the original dataset.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix,accuracy_score,classification_report
from sklearn.metrics import roc_auc_score,roc_curve
from sklearn.metrics import f1_score
import statsmodels.api as sm
from sklearn.metrics import precision_score,recall_score
from yellowbrick.classifier import DiscriminationThreshold
#splitting train and test data
train,test = train_test_split(telcom,test_size = .30 ,random_state = 42,stratify=y)
##seperating dependent and independent variables
cols = [i for i in telcom.columns if i not in Id_col + target_col]
train_X = train[cols]
train_Y = train[target_col]
test_X = test[cols]
test_Y = test[target_col]
#Function attributes
#dataframe - processed dataframe
#Algorithm - Algorithm used
#training_x - predictor variables dataframe(training)
#testing_x - predictor variables dataframe(testing)
#training_y - target variable(training)
#training_y - target variable(testing)
#cf - ["coefficients","features"](cooefficients for logistic
#regression,features for tree based models)
#threshold_plot - if True returns threshold plot for model
def telecom_churn_prediction(algorithm,training_x,testing_x,
training_y,testing_y,cols,cf,threshold_plot) :
#model
algorithm.fit(training_x,training_y)
predictions = algorithm.predict(testing_x)
probabilities = algorithm.predict_proba(testing_x)
#confusion matrix
conf_matrix = confusion_matrix(testing_y,predictions)
#roc_auc_score
model_roc_auc = roc_auc_score(testing_y,predictions)
print ("Area under curve : ",model_roc_auc,"\n")
fpr,tpr,thresholds = roc_curve(testing_y,probabilities[:,1])
#plot confusion matrix
trace1 = go.Heatmap(z = conf_matrix ,
x = ["Not churn","Churn"],
y = ["Not churn","Churn"],
showscale = False,colorscale = "Picnic",
name = "matrix")
#plot roc curve
trace2 = go.Scatter(x = fpr,y = tpr,
name = "Roc : " + str(model_roc_auc),
line = dict(color = ('rgb(22, 96, 167)'),width = 2))
trace3 = go.Scatter(x = [0,1],y=[0,1],
line = dict(color = ('rgb(205, 12, 24)'),width = 2,
dash = 'dot'))
#coeffs
if cf in ['coefficients', 'features']:
if cf == "coefficients" :
coefficients = pd.DataFrame(algorithm.coef_.ravel())
elif cf == "features" :
coefficients = pd.DataFrame(algorithm.feature_importances_)
column_df = pd.DataFrame(cols)
coef_sumry = (pd.merge(coefficients,column_df,left_index= True,
right_index= True, how = "left"))
coef_sumry.columns = ["coefficients","features"]
coef_sumry = coef_sumry.sort_values(by = "coefficients",ascending = False)
#plot coeffs
trace4 = go.Bar(x = coef_sumry["features"],y = coef_sumry["coefficients"],
name = "coefficients",
marker = dict(color = coef_sumry["coefficients"],
colorscale = "Picnic",
line = dict(width = .6,color = "black")))
#subplots
fig = tls.make_subplots(rows=2, cols=2, specs=[[{}, {}], [{'colspan': 2}, None]],
subplot_titles=('Confusion Matrix',
'Receiver operating characteristic',
'Feature Importances'))
fig.append_trace(trace1,1,1)
fig.append_trace(trace2,1,2)
fig.append_trace(trace3,1,2)
fig.append_trace(trace4,2,1)
fig['layout'].update(showlegend=False, title="Model performance" ,
autosize = False,height = 900,width = 800,
plot_bgcolor = 'rgba(240,240,240, 0.95)',
paper_bgcolor = 'rgba(240,240,240, 0.95)',
margin = dict(b = 195))
fig["layout"]["xaxis2"].update(dict(title = "false positive rate"))
fig["layout"]["yaxis2"].update(dict(title = "true positive rate"))
fig["layout"]["xaxis3"].update(dict(showgrid = True,tickfont = dict(size = 10),
tickangle = 90))
elif cf == 'None':
#subplots
fig = make_subplots(rows=1, cols=2,
subplot_titles=('Confusion matrix',
'Receiver operating characteristic')
)
fig.append_trace(trace1,1,1)
fig.append_trace(trace2,1,2)
fig.append_trace(trace3,1,2)
fig['layout'].update(showlegend=False, title="Model performance",
autosize=False, height = 500, width = 800,
plot_bgcolor = 'rgba(240,240,240,0.95)',
paper_bgcolor = 'rgba(240,240,240,0.95)',
margin = dict(b = 195))
fig["layout"]["xaxis2"].update(dict(title = "false positive rate"))
fig["layout"]["yaxis2"].update(dict(title = "true positive rate"))
# Print values for If and If Else Loop for all conditions
print (algorithm)
print ("\n Classification report : \n",classification_report(testing_y,predictions))
print ("Accuracy Score : ",accuracy_score(testing_y,predictions))
py.iplot(fig)
if threshold_plot == True :
visualizer = DiscriminationThreshold(algorithm)
visualizer.fit(training_x,training_y)
visualizer.poof()
telcom.head()
| gender | SeniorCitizen | Partner | Dependents | PhoneService | OnlineSecurity | OnlineBackup | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | PaperlessBilling | Churn | MultipleLines_No | MultipleLines_No phone service | MultipleLines_Yes | InternetService_DSL | InternetService_Fiber optic | InternetService_No | Contract_Month-to-month | Contract_One year | Contract_Two year | PaymentMethod_Bank transfer (automatic) | PaymentMethod_Credit card (automatic) | PaymentMethod_Electronic check | PaymentMethod_Mailed check | tenure | MonthlyCharges | TotalCharges | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | -1.280248 | -1.161694 | -0.994194 |
| 1 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0.064303 | -0.260878 | -0.173740 |
| 2 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | -1.239504 | -0.363923 | -0.959649 |
| 3 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0.512486 | -0.747850 | -0.195248 |
| 4 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | -1.239504 | 0.196178 | -0.940457 |
logit = LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False)
telecom_churn_prediction(logit,train_X,test_X,train_Y,test_Y,
cols,"coefficients",threshold_plot = True)
Area under curve : 0.7253060740699825
LogisticRegression(multi_class='ovr', n_jobs=1, solver='liblinear')
Classification report :
precision recall f1-score support
0 0.85 0.89 0.87 1549
1 0.65 0.56 0.60 561
accuracy 0.80 2110
macro avg 0.75 0.73 0.74 2110
weighted avg 0.80 0.80 0.80 2110
Accuracy Score : 0.8028436018957346
print(x.select_dtypes(include='category').columns)
Index([], dtype='object')
from imblearn.over_sampling import SMOTE
cols = [i for i in telcom.columns if i not in Id_col+target_col]
smote_X = telcom[cols]
smote_Y = telcom[target_col]
#Split train and test data
smote_train_X,smote_test_X,smote_train_Y,smote_test_Y = train_test_split(smote_X,smote_Y,
test_size = .25 ,
random_state = 111,stratify=y)
#oversampling minority class using smote
os = SMOTE(random_state = 0)
os_smote_X,os_smote_Y = os.fit_resample(smote_train_X,smote_train_Y)
os_smote_X = pd.DataFrame(data = os_smote_X,columns=cols)
os_smote_Y = pd.DataFrame(data = os_smote_Y,columns=target_col)
###
logit_smote = LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False)
telecom_churn_prediction(logit_smote,os_smote_X,test_X,os_smote_Y,test_Y,
cols,"coefficients",threshold_plot = True)
Area under curve : 0.7470180865350424
LogisticRegression(multi_class='ovr', n_jobs=1, solver='liblinear')
Classification report :
precision recall f1-score support
0 0.88 0.77 0.82 1549
1 0.53 0.72 0.61 561
accuracy 0.76 2110
macro avg 0.71 0.75 0.72 2110
weighted avg 0.79 0.76 0.77 2110
Accuracy Score : 0.7587677725118483
Recursive Feature Elimination (RFE) is based on the idea to repeatedly construct a model and choose either the best or worst performing feature, setting the feature aside and then repeating the process with the rest of the features. This process is applied until all features in the dataset are exhausted. The goal of RFE is to select features by recursively considering smaller and smaller sets of features.
from sklearn.feature_selection import RFE
logit = LogisticRegression()
rfe = RFE(logit,10)
rfe = rfe.fit(os_smote_X,os_smote_Y.values.ravel())
rfe.support_
rfe.ranking_
#identified columns Recursive Feature Elimination
idc_rfe = pd.DataFrame({"rfe_support" :rfe.support_,
"columns" : [i for i in telcom.columns if i not in Id_col + target_col],
"ranking" : rfe.ranking_,
})
cols = idc_rfe[idc_rfe["rfe_support"] == True]["columns"].tolist()
#separating train and test data
train_rf_X = os_smote_X[cols]
train_rf_Y = os_smote_Y
test_rf_X = test[cols]
test_rf_Y = test[target_col]
logit_rfe = LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=False, warm_start=False)
#applying model
telecom_churn_prediction(logit_rfe,train_rf_X,test_rf_X,train_rf_Y,test_rf_Y,
cols,"coefficients",threshold_plot = True)
tab_rk = ff.create_table(idc_rfe)
py.iplot(tab_rk)
Area under curve : 0.7267635148431109
LogisticRegression(multi_class='ovr', n_jobs=1, solver='liblinear',
verbose=False)
Classification report :
precision recall f1-score support
0 0.90 0.65 0.76 1549
1 0.45 0.80 0.58 561
accuracy 0.69 2110
macro avg 0.68 0.73 0.67 2110
weighted avg 0.78 0.69 0.71 2110
Accuracy Score : 0.6914691943127962
from sklearn.feature_selection import chi2
from sklearn.feature_selection import SelectKBest
#select columns
cols = [i for i in telcom.columns if i not in Id_col + target_col ]
#dataframe with non negative values
df_x = df_telcom_og[cols]
df_y = df_telcom_og[target_col]
#fit model with k= 3
select = SelectKBest(score_func = chi2,k = 3)
fit = select.fit(df_x,df_y)
#Summerize scores
print ("scores")
print (fit.scores_)
print ("P - Values")
print (fit.pvalues_)
#create dataframe
score = pd.DataFrame({"features":cols,"scores":fit.scores_,"p_values":fit.pvalues_ })
score = score.sort_values(by = "scores" ,ascending =False)
#createing new label for categorical and numerical columns
score["feature_type"] = np.where(score["features"].isin(num_cols),"Numerical","Categorical")
#plot
trace = go.Scatter(x = score[score["feature_type"] == "Categorical"]["features"],
y = score[score["feature_type"] == "Categorical"]["scores"],
name = "Categorial",mode = "lines+markers",
marker = dict(color = "red",
line = dict(width =1))
)
trace1 = go.Bar(x = score[score["feature_type"] == "Numerical"]["features"],
y = score[score["feature_type"] == "Numerical"]["scores"],name = "Numerical",
marker = dict(color = "royalblue",
line = dict(width =1)),
xaxis = "x2",yaxis = "y2"
)
layout = go.Layout(dict(title = "Scores for Categorical & Numerical features",
plot_bgcolor = "rgb(243,243,243)",
paper_bgcolor = "rgb(243,243,243)",
xaxis = dict(gridcolor = 'rgb(255, 255, 255)',
tickfont = dict(size =10),
domain=[0, 0.7],
tickangle = 90,zerolinewidth=1,
ticklen=5,gridwidth=2),
yaxis = dict(gridcolor = 'rgb(255, 255, 255)',
title = "scores",
zerolinewidth=1,ticklen=5,gridwidth=2),
margin = dict(b=200),
xaxis2=dict(domain=[0.8, 1],tickangle = 90,
gridcolor = 'rgb(255, 255, 255)'),
yaxis2=dict(anchor='x2',gridcolor = 'rgb(255, 255, 255)')
)
)
data=[trace,trace1]
fig = go.Figure(data=data,layout=layout)
py.iplot(fig)
scores [2.54297062e-01 1.33482766e+02 8.18577694e+01 1.31271509e+02 9.29483891e-02 1.47165601e+02 3.12098318e+01 2.02160070e+01 1.35439602e+02 1.73206148e+01 1.59306111e+01 1.04979224e+02 3.88864216e+00 8.68247305e-01 6.51465136e+00 7.11376111e+01 3.72082851e+02 2.85475152e+02 5.16714004e+02 1.76608724e+02 4.86223101e+02 7.66190658e+01 9.99725387e+01 4.24113152e+02 4.47251434e+01 1.63773281e+04 3.65307468e+03 6.29630810e+05] P - Values [6.14065505e-001 7.08954608e-031 1.46240915e-019 2.15953960e-030 7.60461827e-001 7.21988253e-034 2.31590182e-008 6.91717063e-006 2.64595220e-031 3.15742928e-005 6.57073922e-005 1.23423173e-024 4.86137123e-002 3.51440986e-001 1.06989295e-002 3.33158163e-017 6.58713045e-083 4.81399951e-064 2.19511926e-114 2.66631661e-040 9.45428638e-108 2.07328356e-018 1.54524820e-023 3.10584857e-094 2.26727030e-011 0.00000000e+000 0.00000000e+000 0.00000000e+000]
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from IPython.display import SVG,display
#top 3 categorical features
features_cat = score[score["feature_type"] == "Categorical"]["features"][:3].tolist()
#top 3 numerical features
features_num = score[score["feature_type"] == "Numerical"]["features"][:3].tolist()
#Function attributes
#columns - selected columns
#maximum_depth - depth of tree
#criterion_type - ["gini" or "entropy"]
#split_type - ["best" or "random"]
#Model Performance - True (gives model output)
def plot_decision_tree(columns,maximum_depth,criterion_type,
split_type,model_performance = None) :
#separating dependent and in dependent variables
dtc_x = df_x[columns]
dtc_y = df_y[target_col]
#model
dt_classifier = DecisionTreeClassifier(max_depth = maximum_depth,
splitter = split_type,
criterion = criterion_type,
)
dt_classifier.fit(dtc_x,dtc_y)
#model performance
if model_performance == True :
telecom_churn_prediction(dt_classifier,
dtc_x,test_X[columns],
dtc_y,test_Y,
columns,"features",threshold_plot = True)
#display(graph)
plot_decision_tree(features_num,3,"gini","best")
plot_decision_tree(features_cat,3,"entropy","best",
model_performance = True ,)
Area under curve : 0.6773699091703116
DecisionTreeClassifier(criterion='entropy', max_depth=3)
Classification report :
precision recall f1-score support
0 0.83 0.84 0.83 1549
1 0.53 0.52 0.53 561
accuracy 0.75 2110
macro avg 0.68 0.68 0.68 2110
weighted avg 0.75 0.75 0.75 2110
Accuracy Score : 0.7516587677725118
#using contract,tenure and paperless billing variables
columns = ['tenure','Contract_Month-to-month', 'PaperlessBilling',
'Contract_One year', 'Contract_Two year']
plot_decision_tree(columns,3,"gini","best",model_performance= True)
Area under curve : 0.6979933002604177
DecisionTreeClassifier(max_depth=3)
Classification report :
precision recall f1-score support
0 0.86 0.73 0.79 1549
1 0.47 0.66 0.55 561
accuracy 0.72 2110
macro avg 0.67 0.70 0.67 2110
weighted avg 0.76 0.72 0.73 2110
Accuracy Score : 0.7151658767772512
def telecom_churn_prediction_alg(algorithm,training_x,testing_x,
training_y,testing_y,threshold_plot = True) :
#model
algorithm.fit(training_x,training_y)
predictions = algorithm.predict(testing_x)
probabilities = algorithm.predict_proba(testing_x)
print (algorithm)
print ("\n Classification report : \n",classification_report(testing_y,predictions))
print ("Accuracy Score : ",accuracy_score(testing_y,predictions))
#confusion matrix
conf_matrix = confusion_matrix(testing_y,predictions)
#roc_auc_score
model_roc_auc = roc_auc_score(testing_y,predictions)
print ("Area under curve : ",model_roc_auc)
fpr,tpr,thresholds = roc_curve(testing_y,probabilities[:,1])
#plot roc curve
trace1 = go.Scatter(x = fpr,y = tpr,
name = "Roc : " + str(model_roc_auc),
line = dict(color = ('rgb(22, 96, 167)'),width = 2),
)
trace2 = go.Scatter(x = [0,1],y=[0,1],
line = dict(color = ('rgb(205, 12, 24)'),width = 2,
dash = 'dot'))
#plot confusion matrix
trace3 = go.Heatmap(z = conf_matrix ,x = ["Not churn","Churn"],
y = ["Not churn","Churn"],
showscale = False,colorscale = "Blues",name = "matrix",
xaxis = "x2",yaxis = "y2"
)
layout = go.Layout(dict(title="Model performance" ,
autosize = False,height = 500,width = 800,
showlegend = False,
plot_bgcolor = "rgb(243,243,243)",
paper_bgcolor = "rgb(243,243,243)",
xaxis = dict(title = "false positive rate",
gridcolor = 'rgb(255, 255, 255)',
domain=[0, 0.6],
ticklen=5,gridwidth=2),
yaxis = dict(title = "true positive rate",
gridcolor = 'rgb(255, 255, 255)',
zerolinewidth=1,
ticklen=5,gridwidth=2),
margin = dict(b=200),
xaxis2=dict(domain=[0.7, 1],tickangle = 90,
gridcolor = 'rgb(255, 255, 255)'),
yaxis2=dict(anchor='x2',gridcolor = 'rgb(255, 255, 255)')
)
)
data = [trace1,trace2,trace3]
fig = go.Figure(data=data,layout=layout)
py.iplot(fig)
if threshold_plot == True :
visualizer = DiscriminationThreshold(algorithm)
visualizer.fit(training_x,training_y)
visualizer.poof()
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=1, n_neighbors=5, p=2,
weights='uniform')
telecom_churn_prediction_alg(knn,train_X,test_X,train_Y,test_Y,threshold_plot = True)
#telecom_churn_prediction_alg(knn,os_smote_X,test_X,
#os_smote_Y,test_Y,threshold_plot = True)
KNeighborsClassifier(n_jobs=1)
Classification report :
precision recall f1-score support
0 0.84 0.84 0.84 1549
1 0.55 0.55 0.55 561
accuracy 0.76 2110
macro avg 0.69 0.69 0.69 2110
weighted avg 0.76 0.76 0.76 2110
Accuracy Score : 0.7611374407582938
Area under curve : 0.6940582677110987
from sklearn.ensemble import RandomForestClassifier
#function attributes
#columns - column used
#nf_estimators - The number of trees in the forest.
#estimated_tree - tree number to be displayed
#maximum_depth - depth of the tree
#criterion_type - split criterion type ["gini" or "entropy"]
#Model performance - prints performance of model
def plot_tree_randomforest(columns,nf_estimators,
estimated_tree,maximum_depth,
criterion_type,model_performance = None) :
dataframe = df_telcom_og[columns + target_col].copy()
#train and test datasets
rf_x = dataframe[[i for i in columns if i not in target_col]]
rf_y = dataframe[target_col]
#random forest classifier
rfc = RandomForestClassifier(n_estimators = nf_estimators,
max_depth = maximum_depth,
criterion = criterion_type,
)
rfc.fit(rf_x,rf_y)
estimated_tree = rfc.estimators_[estimated_tree]
#model performance
if model_performance == True :
telecom_churn_prediction(rfc,
rf_x,test_X[columns],
rf_y,test_Y,
columns,"features",threshold_plot = True)
cols1 = [ i for i in train_X.columns if i not in target_col + Id_col]
plot_tree_randomforest(cols1,100,99,3,"entropy",True)
Area under curve : 0.6558356895196602
RandomForestClassifier(criterion='entropy', max_depth=3)
Classification report :
precision recall f1-score support
0 0.81 0.91 0.85 1549
1 0.61 0.40 0.49 561
accuracy 0.77 2110
macro avg 0.71 0.66 0.67 2110
weighted avg 0.76 0.77 0.76 2110
Accuracy Score : 0.7734597156398104
#Making 10 trees with Random Forest.
n = np.arange(0,10).tolist()
cols1 = [ i for i in train_X.columns if i not in target_col + Id_col]
for i in n :
plot_tree_randomforest(cols1,10,i,3,"entropy",model_performance=False)
#making 10 trees with random forest for columns
#selected from recursive feature elimination
n = np.arange(0,10).tolist()
cols = idc_rfe[idc_rfe["rfe_support"] == True]["columns"].tolist()
for i in n :
plot_tree_randomforest(cols,10,i,3,"gini",model_performance=False)
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB(priors=None)
#telecom_churn_prediction_alg(gnb,os_smote_X,test_X,os_smote_Y,test_Y)
telecom_churn_prediction_alg(gnb,train_X,test_X,train_Y,test_Y)
GaussianNB()
Classification report :
precision recall f1-score support
0 0.90 0.72 0.80 1549
1 0.50 0.77 0.60 561
accuracy 0.73 2110
macro avg 0.70 0.74 0.70 2110
weighted avg 0.79 0.73 0.75 2110
Accuracy Score : 0.7331753554502369
Area under curve : 0.7443678803759313
from sklearn.svm import SVC
#Support vector classifier
#using linear hyper plane
svc_lin = SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma=1.0, kernel='linear',
max_iter=-1, probability=True, random_state=None, shrinking=True,
tol=0.001, verbose=False)
cols = [i for i in telcom.columns if i not in Id_col + target_col]
telecom_churn_prediction(svc_lin,train_X,test_X,train_Y,test_Y,
cols,"coefficients",threshold_plot = False)
#telecom_churn_prediction(svc_lin,os_smote_X,test_X,os_smote_Y,test_Y,
#cols,"coefficients",threshold_plot = False)
Area under curve : 0.7172075826046129
SVC(gamma=1.0, kernel='linear', probability=True)
Classification report :
precision recall f1-score support
0 0.84 0.89 0.87 1549
1 0.64 0.55 0.59 561
accuracy 0.80 2110
macro avg 0.74 0.72 0.73 2110
weighted avg 0.79 0.80 0.79 2110
Accuracy Score : 0.7976303317535545
#tuning parameters
#Support vector classifier
#using non-linear hyper plane("rbf")
svc_rbf = SVC(C=1.0, kernel='rbf',
degree= 3, gamma=1.0,
coef0=0.0, shrinking=True,
probability=True,tol=0.001,
cache_size=200, class_weight=None,
verbose=False,max_iter= -1,
random_state=None)
telecom_churn_prediction_alg(svc_rbf,os_smote_X,test_X,os_smote_Y,test_Y,threshold_plot = False)
SVC(gamma=1.0, probability=True)
Classification report :
precision recall f1-score support
0 0.94 0.88 0.91 1549
1 0.73 0.85 0.78 561
accuracy 0.87 2110
macro avg 0.83 0.87 0.85 2110
weighted avg 0.88 0.87 0.88 2110
Accuracy Score : 0.8739336492890996
Area under curve : 0.8652491573541207
from lightgbm import LGBMClassifier
lgbm_c = LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
learning_rate=0.5, max_depth=7, min_child_samples=20,
min_child_weight=0.001, min_split_gain=0.0, n_estimators=100,
n_jobs=-1, num_leaves=500, objective='binary', random_state=None,
reg_alpha=0.0, reg_lambda=0.0, silent=True, subsample=1.0,
subsample_for_bin=200000, subsample_freq=0)
cols = [i for i in telcom.columns if i not in Id_col + target_col]
#telecom_churn_prediction(lgbm_c,os_smote_X,test_X,os_smote_Y,test_Y,
#cols,"features",threshold_plot = True)
telecom_churn_prediction(lgbm_c,train_X,test_X,train_Y,test_Y,
cols,"features",threshold_plot = True)
Area under curve : 0.6811375057681973
LGBMClassifier(learning_rate=0.5, max_depth=7, num_leaves=500,
objective='binary')
Classification report :
precision recall f1-score support
0 0.83 0.86 0.84 1549
1 0.56 0.51 0.53 561
accuracy 0.76 2110
macro avg 0.69 0.68 0.69 2110
weighted avg 0.76 0.76 0.76 2110
Accuracy Score : 0.7630331753554502
from xgboost import XGBClassifier
xgc = XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bytree=1, gamma=0, learning_rate=0.9, max_delta_step=0,
max_depth = 7, min_child_weight=1, missing=1, n_estimators=100,
n_jobs=1, nthread=None, objective='binary:logistic', random_state=42,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
silent=True, subsample=1,verbosity=0)
telecom_churn_prediction(xgc,train_X,test_X,train_Y,test_Y,
cols,"features",threshold_plot = True)
#telecom_churn_prediction(xgc,os_smote_X,test_X,os_smote_Y,test_Y,
#cols,"features",threshold_plot = True)
Area under curve : 0.684688183624879
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
importance_type='gain', interaction_constraints='',
learning_rate=0.9, max_delta_step=0, max_depth=7,
min_child_weight=1, missing=1, monotone_constraints='()',
n_estimators=100, n_jobs=1, nthread=1, num_parallel_tree=1,
random_state=42, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
seed=42, silent=True, subsample=1, tree_method='exact',
validate_parameters=1, verbosity=0)
Classification report :
precision recall f1-score support
0 0.83 0.86 0.85 1549
1 0.57 0.51 0.54 561
accuracy 0.77 2110
macro avg 0.70 0.68 0.69 2110
weighted avg 0.76 0.77 0.76 2110
Accuracy Score : 0.7682464454976303
from xgboost import plot_importance
fig, ax = plt.subplots(figsize=(15,12))
plot_importance(xgc, ax=ax)
<AxesSubplot:title={'center':'Feature importance'}, xlabel='F score', ylabel='Features'>
from sklearn.ensemble import AdaBoostClassifier
#ada = AdaBoostClassifier(random_state=124)
# TODO: Initialize the classifier
ada = AdaBoostClassifier(base_estimator=DecisionTreeClassifier())
telecom_churn_prediction(ada,train_X,test_X,train_Y,test_Y,
cols,"features",threshold_plot = True)
#telecom_churn_prediction(ada,os_smote_X,test_X,os_smote_Y,test_Y,
#cols,"features",threshold_plot = True)
Area under curve : 0.6319924648068043
AdaBoostClassifier(base_estimator=DecisionTreeClassifier())
Classification report :
precision recall f1-score support
0 0.80 0.84 0.82 1549
1 0.49 0.43 0.45 561
accuracy 0.73 2110
macro avg 0.64 0.63 0.64 2110
weighted avg 0.72 0.73 0.72 2110
Accuracy Score : 0.728436018957346
from sklearn.ensemble import GradientBoostingClassifier
gbc = GradientBoostingClassifier(random_state=111)
#telecom_churn_prediction(gbc,os_smote_X,test_X,os_smote_Y,test_Y,cols,"features",threshold_plot = True)
telecom_churn_prediction(gbc, train_X,test_X,train_Y,test_Y,cols,"features", threshold_plot=True)
Area under curve : 0.7015503073111397
GradientBoostingClassifier(random_state=111)
Classification report :
precision recall f1-score support
0 0.83 0.90 0.86 1549
1 0.64 0.50 0.57 561
accuracy 0.79 2110
macro avg 0.74 0.70 0.72 2110
weighted avg 0.78 0.79 0.79 2110
Accuracy Score : 0.7938388625592417
from sklearn.ensemble import BaggingClassifier
bgc = BaggingClassifier(random_state=42)
telecom_churn_prediction(bgc,train_X,test_X,train_Y,test_Y,cols,"None",threshold_plot = True)
#telecom_churn_prediction(bgc,os_smote_X,test_X,os_smote_Y,test_Y,cols,"None",threshold_plot = True)
Area under curve : 0.663794363334864
BaggingClassifier(random_state=42)
Classification report :
precision recall f1-score support
0 0.81 0.90 0.85 1549
1 0.60 0.43 0.50 561
accuracy 0.77 2110
macro avg 0.71 0.66 0.68 2110
weighted avg 0.76 0.77 0.76 2110
Accuracy Score : 0.7734597156398104
def telecom_churn_prediction_cat(algorithm,training_x,testing_x,
training_y,testing_y,threshold_plot = False) :
#model
algorithm.fit(training_x, training_y,
eval_set=(training_x, training_y),
verbose=False)
predictions = algorithm.predict(testing_x)
probabilities = algorithm.predict_proba(testing_x)
print (algorithm)
print ("\n CatBoost Classification report : \n",classification_report(testing_y,predictions))
print ("CatBoost Accuracy Score : ",accuracy_score(testing_y,predictions))
#confusion matrix
conf_matrix = confusion_matrix(testing_y,predictions)
#roc_auc_score
model_roc_auc = roc_auc_score(testing_y,predictions)
print ("Area under curve : ",model_roc_auc)
fpr,tpr,thresholds = roc_curve(testing_y,probabilities[:,1])
#plot roc curve
trace1 = go.Scatter(x = fpr,y = tpr,
name = "Roc : " + str(model_roc_auc),
line = dict(color = ('rgb(22, 96, 167)'),width = 2),
)
trace2 = go.Scatter(x = [0,1],y=[0,1],
line = dict(color = ('rgb(205, 12, 24)'),width = 2,
dash = 'dot'))
#plot confusion matrix
trace3 = go.Heatmap(z = conf_matrix ,x = ["Not churn","Churn"],
y = ["Not churn","Churn"],
showscale = False,colorscale = "Blues",name = "matrix",
xaxis = "x2",yaxis = "y2"
)
layout = go.Layout(dict(title="Model performance" ,
autosize = False,height = 500,width = 800,
showlegend = False,
plot_bgcolor = "rgb(243,243,243)",
paper_bgcolor = "rgb(243,243,243)",
xaxis = dict(title = "false positive rate",
gridcolor = 'rgb(255, 255, 255)',
domain=[0, 0.6],
ticklen=5,gridwidth=2),
yaxis = dict(title = "true positive rate",
gridcolor = 'rgb(255, 255, 255)',
zerolinewidth=1,
ticklen=5,gridwidth=2),
margin = dict(b=200),
xaxis2=dict(domain=[0.7, 1],tickangle = 90,
gridcolor = 'rgb(255, 255, 255)'),
yaxis2=dict(anchor='x2',gridcolor = 'rgb(255, 255, 255)')
)
)
data = [trace1,trace2,trace3]
fig = go.Figure(data=data,layout=layout)
py.iplot(fig)
if threshold_plot == True :
visualizer = DiscriminationThreshold(algorithm)
visualizer.fit(training_x,training_y)
visualizer.poof()
from catboost import CatBoostClassifier
catboost_clf = CatBoostClassifier(cat_features=categorical_columns,
l2_leaf_reg=120, depth=6,
auto_class_weights='Balanced',
iterations=200, learning_rate=0.16,
use_best_model=True,
early_stopping_rounds=150,
eval_metric='F1', random_state=0)
telecom_churn_prediction_cat(catboost_clf,train_X,test_X,train_Y,test_Y)
#telecom_churn_prediction_cat(catboost_clf,os_smote_X,test_X,os_smote_Y,test_Y)
<catboost.core.CatBoostClassifier object at 0x131cb9dc0>
CatBoost Classification report :
precision recall f1-score support
0 0.90 0.74 0.81 1549
1 0.52 0.78 0.62 561
accuracy 0.75 2110
macro avg 0.71 0.76 0.72 2110
weighted avg 0.80 0.75 0.76 2110
CatBoost Accuracy Score : 0.7488151658767772
Area under curve : 0.758430774152492
For performance assessment of the chosen models, various metrics are used: 1.Feature weights: Indicates the top features used by the model to generate the predictions
2.Confusion matrix: Shows a grid of true and false predictions compared to the actual values
3.Accuracy score: Shows the overall accuracy of the model for training set and test set
4.ROC Curve: Shows the diagnostic ability of a model by bringing together true positive rate (TPR) and false positive rate (FPR) for different thresholds of class predictions (e.g. thresholds of 10%, 50% or 90% resulting to a prediction of churn)
5.AUC (for ROC): Measures the overall separability between classes of the model related to the ROC curve
6.Precision-Recall-Curve: Shows the diagnostic ability by comparing false positive rate (FPR) and false negative rate (FNR) for different thresholds of class predictions. It is suitable for data sets with high class imbalances (negative values overrepresented) as it focuses on precision and recall, which are not dependent on the number of true negatives and thereby excludes the imbalance
7.F1 Score: Builds the harmonic mean of precision and recall and thereby measures the compromise between both.
8.AUC (for PRC): Measures the overall separability between classes of the model related to the Precision-Recall curve
from sklearn.metrics import f1_score
from sklearn.metrics import cohen_kappa_score
#gives model report in dataframe
def model_report(model,training_x,testing_x,training_y,testing_y,name) :
model.fit(training_x,training_y)
predictions = model.predict(testing_x)
accuracy = accuracy_score(testing_y,predictions)
recallscore = recall_score(testing_y,predictions)
precision = precision_score(testing_y,predictions)
roc_auc = roc_auc_score(testing_y,predictions)
f1score = f1_score(testing_y,predictions)
kappa_metric = cohen_kappa_score(testing_y,predictions)
df = pd.DataFrame({"Model" : [name],
"Accuracy_score" : [accuracy],
"Recall_score" : [recallscore],
"Precision" : [precision],
"f1_score" : [f1score],
"Area_under_curve": [roc_auc],
"Kappa_metric" : [kappa_metric],
})
df.sort_values(by=['Accuracy_score'], inplace=True)
return df
#outputs for every model
model1 = model_report(logit,train_X,test_X,train_Y,test_Y,
"Logit Regression(BM)")
model2 = model_report(logit_smote,os_smote_X,test_X,os_smote_Y,test_Y,
"Logit Regression(SM)")
model3 = model_report(logit_rfe,train_rf_X,test_rf_X,train_rf_Y,test_rf_Y,
"Logit Regression(RFE)")
decision_tree = DecisionTreeClassifier(max_depth = 9,
random_state = 42,
splitter = "best",
criterion = "gini",
)
model4 = model_report(decision_tree,train_X,test_X,train_Y,test_Y,
"Decision Tree")
model5 = model_report(knn,train_X,test_X,train_Y,test_Y,
"KNN Classifier")
rfc = RandomForestClassifier(n_estimators = 1000,
random_state = 123,
max_depth = 9,
criterion = "gini")
model6 = model_report(rfc,train_X,test_X,train_Y,test_Y,
"Ran Forest Classifier")
model7 = model_report(gnb,train_X,test_X,train_Y,test_Y,
"Naive Bayes")
model8 = model_report(svc_lin,train_X,test_X,train_Y,test_Y,
"SVM Classifier Lin")
model9 = model_report(svc_rbf,train_X,test_X,train_Y,test_Y,
"SVM Classifier RBF")
model10 = model_report(lgbm_c,train_X,test_X,train_Y,test_Y,
"LGBM Classifier")
model11 = model_report(xgc,train_X,test_X,train_Y,test_Y,
"XGBoost Classifier")
model12 = model_report(gbc,train_X,test_X,train_Y,test_Y,
"Grad Boost Classifier")
model13 = model_report(ada,train_X,test_X,train_Y,test_Y,
"AdaBoost Classifier")
model14 = model_report(bgc,train_X,test_X,train_Y,test_Y,
"Bagging Classifier")
model15 = model_report(bgc,train_X,test_X,train_Y,test_Y,
"CatBoost Classifier")
#concat all models
model_performances = pd.concat([model1,model2,model3,
model4,model5,model6,
model7,model8,model9,
model10,model11,model12,model13,model14,model15],axis = 0).reset_index()
model_performances = model_performances.drop(columns = "index",axis =1)
model_performances.sort_values(by=['Accuracy_score'],ascending= [False], inplace=True)
table = ff.create_table(np.round(model_performances,4))
py.iplot(table)
model_performances
def output_tracer(metric,color) :
tracer = go.Bar(y = model_performances["Model"] ,
x = model_performances[metric],
orientation = "h",name = metric ,
marker = dict(line = dict(width =.7),
color = color)
)
return tracer
layout = go.Layout(dict(title = "Model performances",
plot_bgcolor = "rgb(243,243,243)",
paper_bgcolor = "rgb(243,243,243)",
xaxis = dict(gridcolor = 'rgb(255, 255, 255)',
title = "metric",
zerolinewidth=1,
ticklen=5,gridwidth=2),
yaxis = dict(gridcolor = 'rgb(255, 255, 255)',
zerolinewidth=1,ticklen=5,gridwidth=2),
margin = dict(l = 250),
height = 780
)
)
trace1 = output_tracer("Accuracy_score","#6699FF")
trace2 = output_tracer('Recall_score',"red")
trace3 = output_tracer('Precision',"#33CC99")
trace4 = output_tracer('f1_score',"lightgrey")
trace5 = output_tracer('Kappa_metric',"#FFCC99")
data = [trace1,trace2,trace3,trace4,trace5]
fig = go.Figure(data=data,layout=layout)
py.iplot(fig)
lst = [logit,logit_smote,decision_tree,knn,rfc,
gnb,svc_lin,svc_rbf,lgbm_c,xgc,gbc,ada,bgc,catboost_clf]
length = len(lst)
mods = ['Logistic Regression(Baseline_model)','Logistic Regression(SMOTE)',
'Decision Tree','KNN Classifier','Random Forest Classifier',"Naive Bayes",
'SVM Classifier Linear','SVM Classifier RBF', 'LGBM Classifier',
'XGBoost Classifier','Gradient Boosting Classifier','AdaBoost Classifier','Bagging Classifier','Cat Boost Classifer']
fig = plt.figure(figsize=(18,15))
fig.set_facecolor("#F3F3F3")
for i,j,k in itertools.zip_longest(lst,range(length),mods) :
plt.subplot(4,5,j+1)
predictions = i.predict(test_X)
conf_matrix = confusion_matrix(predictions,test_Y)
sns.heatmap(conf_matrix,annot=True,fmt = "d",square = True,
xticklabels=["not churn","churn"],
yticklabels=["not churn","churn"],
linewidths = 2,linecolor = "w",cmap = "Set1")
plt.title(k,color = "b")
plt.subplots_adjust(wspace = .3,hspace = .3)
lst = [logit,logit_smote,decision_tree,knn,rfc,
gnb,svc_lin,svc_rbf,lgbm_c,xgc,gbc,ada,bgc,catboost_clf]
length = len(lst)
mods = ['Logistic Regression(Baseline_model)','Logistic Regression(SMOTE)',
'Decision Tree','KNN Classifier','Random Forest Classifier',"Naive Bayes",
'SVM Classifier Linear','SVM Classifier RBF', 'LGBM Classifier',
'XGBoost Classifier','Gradient Boost Classifier','AdaBoost Classifier','Bagging Classifier','CatBoost Classifier']
plt.style.use("dark_background")
fig = plt.figure(figsize=(15,16))
fig.set_facecolor("#F3F3F3")
for i,j,k in itertools.zip_longest(lst,range(length),mods) :
qx = plt.subplot(4,5,j+1)
probabilities = i.predict_proba(test_X)
predictions = i.predict(test_X)
fpr,tpr,thresholds = roc_curve(test_Y,probabilities[:,1])
plt.plot(fpr,tpr,linestyle = "dotted",
color = "royalblue",linewidth = 2,
label = "AUC = " + str(np.around(roc_auc_score(test_Y,predictions),3)))
plt.plot([0,1],[0,1],linestyle = "dashed",
color = "orangered",linewidth = 1.5)
plt.fill_between(fpr,tpr,alpha = .4)
plt.fill_between([0,1],[0,1],color = "k")
plt.legend(loc = "lower right",
prop = {"size" : 12})
qx.set_facecolor("k")
plt.grid(True,alpha = .15)
plt.title(k,color = "b")
plt.xticks(np.arange(0,1,.3))
plt.yticks(np.arange(0,1,.3))
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import average_precision_score
lst = [logit,logit_smote,decision_tree,knn,rfc,
gnb,svc_lin,svc_rbf,lgbm_c,xgc,gbc,ada,bgc,catboost_clf]
length = len(lst)
mods = ['Logistic Regression(Baseline_model)','Logistic Regression(SMOTE)',
'Decision Tree','KNN Classifier','Random Forest Classifier',"Naive Bayes",
'SVM Classifier Linear','SVM Classifier RBF', 'LGBM Classifier',
'XGBoost Classifier','Gradient Boost Classifier','AdaBoost Classifier','Bagging Classifier','CatBoost Classifier']
fig = plt.figure(figsize=(15,17))
fig.set_facecolor("#F3F3F3")
for i,j,k in itertools.zip_longest(lst,range(length),mods) :
qx = plt.subplot(4,5,j+1)
probabilities = i.predict_proba(test_X)
predictions = i.predict(test_X)
recall,precision,thresholds = precision_recall_curve(test_Y,probabilities[:,1])
plt.plot(recall,precision,linewidth = 1.5,
label = ("avg_pcn : " +
str(np.around(average_precision_score(test_Y,predictions),3))))
plt.plot([0,1],[0,0],linestyle = "dashed")
plt.fill_between(recall,precision,alpha = .2)
plt.legend(loc = "lower left",
prop = {"size" : 10})
qx.set_facecolor("k")
plt.grid(True,alpha = .15)
plt.title(k,color = "b")
plt.xlabel("recall",fontsize =7)
plt.ylabel("precision",fontsize =7)
plt.xlim([0.25,1])
plt.yticks(np.arange(0,1,.3))
In this section, we are going to try to improve the accuracy of our model. We will first focus on below techniques:
Cross Validation is a technique that consist of dividing the data in multiple folds(k), and at each iteration, using one k-1 fold for training and one fold for validation. This will help to avoid overfitting(our model does not generalize properly on unseen data) and help us choosing the best model. The general term is k-fold cross validation which k in the number of fold the training data is split into.
Hyperparameter Tuning consist of feeding our model with a range of paramters and consider the one that allow the model generate better accuracy.
To address a potential bias stemming from the specific split of the data in the train-test-split part, cross-validation is used during hyperparameter tuning with Grid Search and Randomized Search. Cross validations splits the training data into in a specified amount of folds. For each iteration one fold is held out as “training-dev” set and the other folds are used as training set. Result of cross-validation is k values for all metrics on the k-fold CV.
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.metrics import f1_score
from sklearn.metrics import cohen_kappa_score
from sklearn.metrics import make_scorer
from sklearn.metrics import fbeta_score, accuracy_score
from sklearn.linear_model import LogisticRegression # Logistic Regression Classifier
import time
start_time = time.time()
#Logistic Regression Classifier
logit = LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False)
#Hyperparameters
parameters = {'C':np.logspace(0, 4, 10),
'penalty' : ['l1', 'l2']
}
# Make an fbeta_score scoring object
scorer = make_scorer(fbeta_score,beta=0.5)
# Perform grid search on the classifier using 'scorer' as the scoring method
logit_grid_obj = GridSearchCV(logit, parameters,scorer)
# Fit the grid search object to the training data and find the optimal parameters
logit_grid_fit = logit_grid_obj.fit(train_X,train_Y)
# Get the estimator
logit_best_clf = logit_grid_fit.best_estimator_
# View best hyperparameters
print(logit_grid_fit.best_params_)
# Make predictions using the unoptimized and model
logit_predictions = (logit.fit(train_X, train_Y)).predict(test_X)
logit_best_predictions = logit_best_clf.predict(test_X)
logit_best_predictions_tuned_prob = logit_grid_obj.predict_proba(test_X)
#before hypertuning
logit_before_accuracy=accuracy_score(test_Y, logit_predictions)
logit_before_f1_score=fbeta_score(test_Y, logit_predictions, beta = 0.5)
logit_before_recal = recall_score(test_Y,logit_predictions)
logit_before_precision = precision_score(test_Y,logit_predictions)
logit_before_roc_auc = roc_auc_score(test_Y,logit_predictions)
logit_before_kappa_metric = cohen_kappa_score(test_Y,logit_predictions)
#after hypertuning
logit_hypertuned_accuracy=accuracy_score(test_Y, logit_best_predictions)
logit_hypertuned_f1_score=fbeta_score(test_Y, logit_best_predictions, beta = 0.5)
logit_hypertuned_recal = recall_score(test_Y,logit_best_predictions)
logit_hypertuned_precision = precision_score(test_Y,logit_best_predictions)
logit_hypertuned_roc_auc = roc_auc_score(test_Y,logit_best_predictions)
logit_hypertuned_kappa_metric = cohen_kappa_score(test_Y,logit_best_predictions)
# Report the before-and-afterscores
print ("Un - Optimized Model\n------")
print ("Accuracy Score on Testing Data: {:.4f}".format(logit_before_accuracy))
print ("F-score on Testing Data: {:.4f}".format(logit_before_f1_score))
print ("Recall on Testing Data: {:.4f}".format(logit_before_recal))
print ("Precision on Testing Data: {:.4f}".format(logit_before_precision))
print ("ROC-AUC on Testing Data: {:.4f}".format(logit_before_roc_auc))
print ("Kappa-metric on Testing Data: {:.4f}".format(logit_before_kappa_metric))
print ("\nOptimized Model\n------")
print ("Hypertuned Accuracy Score on the Testing Data: {:.4f}".format(logit_hypertuned_accuracy))
print ("Hypertuned F-score on the Testing Data: {:.4f}".format(logit_hypertuned_f1_score))
print ("Hypertuned Recall on Testing Data: {:.4f}".format(logit_hypertuned_recal))
print ("Hypertuned Precision on Testing Data: {:.4f}".format(logit_hypertuned_precision))
print ("Hypertuned ROC-AUC on Testing Data: {:.4f}".format(logit_hypertuned_roc_auc))
print ("Hypertuned Kappa-metric on Testing Data: {:.4f}".format(logit_hypertuned_kappa_metric))
print (logit_best_clf)
print("--- %s seconds in execution---" % (time.time() - start_time))
{'C': 166.81005372000593, 'penalty': 'l2'}
Un - Optimized Model
------
Accuracy Score on Testing Data: 0.8028
F-score on Testing Data: 0.6298
Recall on Testing Data: 0.5597
Precision on Testing Data: 0.6501
ROC-AUC on Testing Data: 0.7253
Kappa-metric on Testing Data: 0.4715
Optimized Model
------
Hypertuned Accuracy Score on the Testing Data: 0.8024
Hypertuned F-score on the Testing Data: 0.6288
Hypertuned Recall on Testing Data: 0.5561
Hypertuned Precision on Testing Data: 0.6500
Hypertuned ROC-AUC on Testing Data: 0.7238
Hypertuned Kappa-metric on Testing Data: 0.4693
LogisticRegression(C=166.81005372000593, multi_class='ovr', n_jobs=1,
solver='liblinear')
--- 64.11209487915039 seconds in execution---
from sklearn.metrics import precision_recall_curve, auc, f1_score, plot_confusion_matrix, precision_score, recall_score
# Define a function that plots the confusion matrix for a classifier and the train and test accuracy
def confusion_matrix_plot(X_train, y_train, X_test, y_test, classifier, y_pred, classifier_name):
fig, ax = plt.subplots(figsize=(7, 6))
plot_confusion_matrix(classifier, X_test, y_test, display_labels=["No Churn", "Churn"], cmap=plt.cm.Blues,
normalize=None, ax=ax)
ax.set_title(f'{classifier_name} - Confusion Matrix')
plt.show()
fig, ax = plt.subplots(figsize=(7, 6))
plot_confusion_matrix(classifier, X_test, y_test, display_labels=["No Churn", "Churn"],
cmap=plt.cm.Blues, normalize='true', ax=ax)
ax.set_title(f'{classifier_name} - Confusion Matrix (norm.)')
plt.show()
print(f'Accuracy Score Test: {accuracy_score(y_test, y_pred)}')
print(f'Accuracy Score Train: {classifier.score(X_train, y_train)} (as comparison)')
return print("")
# Define a function that plots the ROC curve and the AUC score
def roc_curve_auc_score(X_test, y_test, y_pred_probabilities, classifier_name):
y_pred_prob = y_pred_probabilities[:,1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr, label=f'{classifier_name}')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title(f'{classifier_name} - ROC Curve')
plt.show()
return print(f'AUC Score (ROC): {roc_auc_score(y_test, y_pred_prob)}\n')
# Define a function that plots the precision-recall-curve and the F1 score and AUC score
def precision_recall_curve_and_scores(X_test, y_test, y_pred, y_pred_probabilities, classifier_name):
y_pred_prob = y_pred_probabilities[:,1]
precision, recall, thresholds = precision_recall_curve(y_test, y_pred_prob)
plt.plot(recall, precision, label=f'{classifier_name}')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title(f'{classifier_name} - Precision-Recall Curve')
plt.show()
f1_score_result, auc_score_result = f1_score(y_test, y_pred), auc(recall, precision)
return print(f'F1 Score: {f1_score_result} \nAUC Score (PR): {auc_score_result}\n')
# Plot model evaluations.
confusion_matrix_plot(train_X, train_Y, test_X, test_Y, logit_grid_obj, predictions, 'Log. Regression (Tuned)')
roc_curve_auc_score(test_X, test_Y, logit_best_predictions_tuned_prob, 'Log. Regression (tuned)')
precision_recall_curve_and_scores(test_X, test_Y, predictions, logit_best_predictions_tuned_prob, 'Log. Regression (Tuned)')
Accuracy Score Test: 0.7488151658767772 Accuracy Score Train: 0.6345885634588564 (as comparison)
AUC Score (ROC): 0.8372839011771149
F1 Score: 0.6225071225071225 AUC Score (PR): 0.6118133431147317
# TODO: Import 'GridSearchCV', 'make_scorer', and any other necessary libraries
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import make_scorer
from sklearn.metrics import fbeta_score, accuracy_score
import numpy as np
from sklearn.metrics import f1_score
from sklearn.metrics import cohen_kappa_score
import time
start_time = time.time()
# TODO: Initialize the classifier
ada = AdaBoostClassifier(base_estimator=DecisionTreeClassifier())
#ada = AdaBoostClassifier(random_state=124)
# TODO: Create the parameters list you wish to tune
ada_parameters = {'n_estimators':[50, 120],
# 'learning_rate':[0.1, 0.5, 1.],
'learning_rate':[0, 5, 1],#-> PLEASE USE AS EXP TO SAVE EXECUTION TIME
'base_estimator__min_samples_split' : np.arange(2, 8, 2),
'base_estimator__max_depth' : np.arange(1, 4, 1)
}
# TODO: Make an fbeta_score scoring object
scorer = make_scorer(fbeta_score,beta=0.5)
# TODO: Perform grid search on the classifier using 'scorer' as the scoring method
ada_grid_obj = GridSearchCV(ada, ada_parameters,scorer)
# TODO: Fit the grid search object to the training data and find the optimal parameters
ada_grid_fit = ada_grid_obj.fit(os_smote_X,os_smote_Y)
# Get the estimator
ada_best_clf = ada_grid_fit.best_estimator_
# Make predictions using the unoptimized and model
ada_predictions = (ada.fit(train_X, train_Y)).predict(test_X)
ada_best_predictions_tuned_prob = ada_grid_obj.predict_proba(test_X)
ada_best_predictions = ada_best_clf.predict(test_X)
#before hypertuning
ada_before_accuracy=accuracy_score(test_Y, ada_predictions)
ada_before_f1_score=fbeta_score(test_Y, ada_predictions, beta = 0.5)
ada_before_recal = recall_score(test_Y,ada_predictions)
ada_before_precision = precision_score(test_Y,ada_predictions)
ada_before_roc_auc = roc_auc_score(test_Y,ada_predictions)
ada_before_kappa_metric = cohen_kappa_score(test_Y,ada_predictions)
#after hypertuning
ada_hypertuned_accuracy=accuracy_score(test_Y, ada_best_predictions)
ada_hypertuned_f1_score=fbeta_score(test_Y, ada_best_predictions, beta = 0.5)
ada_hypertuned_recal = recall_score(test_Y,ada_best_predictions)
ada_hypertuned_precision = precision_score(test_Y,ada_best_predictions)
ada_hypertuned_roc_auc = roc_auc_score(test_Y,ada_best_predictions)
ada_hypertuned_kappa_metric = cohen_kappa_score(test_Y,ada_predictions)
# Report the before-and-afterscores
print ("Un - Optimized Model\n------")
print ("Accuracy Score on Testing Data: {:.4f}".format(ada_before_accuracy))
print ("F-score on Testing Data: {:.4f}".format(ada_before_f1_score))
print ("Recall on Testing Data: {:.4f}".format(ada_before_recal))
print ("Precision on Testing Data: {:.4f}".format(ada_before_precision))
print ("ROC-AUC on Testing Data: {:.4f}".format(ada_before_roc_auc))
print ("Kappa-metric on Testing Data: {:.4f}".format(ada_before_kappa_metric))
print ("\nOptimized Model\n------")
print ("Hypertuned Accuracy Score on the Testing Data: {:.4f}".format(ada_hypertuned_accuracy))
print ("Hypertuned F-score on the Testing Data: {:.4f}".format(ada_hypertuned_f1_score))
print ("Hypertuned Recall on Testing Data: {:.4f}".format(ada_hypertuned_recal))
print ("Hypertuned Precision on Testing Data: {:.4f}".format(ada_hypertuned_precision))
print ("Hypertuned ROC-AUC on Testing Data: {:.4f}".format(ada_hypertuned_roc_auc))
print ("Hypertuned Kappa-metric on Testing Data: {:.4f}".format(ada_hypertuned_kappa_metric))
print (ada_best_clf)
print("--- %s seconds in execution---" % (time.time() - start_time))
Un - Optimized Model
------
Accuracy Score on Testing Data: 0.7327
F-score on Testing Data: 0.4900
Recall on Testing Data: 0.4635
Precision on Testing Data: 0.4971
ROC-AUC on Testing Data: 0.6468
Kappa-metric on Testing Data: 0.3002
Optimized Model
------
Hypertuned Accuracy Score on the Testing Data: 0.7744
Hypertuned F-score on the Testing Data: 0.5862
Hypertuned Recall on Testing Data: 0.7433
Hypertuned Precision on Testing Data: 0.5567
Hypertuned ROC-AUC on Testing Data: 0.7645
Hypertuned Kappa-metric on Testing Data: 0.3002
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=2),
learning_rate=1)
--- 93.43132185935974 seconds in execution---
# Plot model evaluations.
confusion_matrix_plot(os_smote_X, os_smote_Y, test_X, test_Y, ada_grid_obj, ada_predictions, 'Ada Boost Classifier (Tuned)')
roc_curve_auc_score(test_X, test_Y, ada_best_predictions_tuned_prob, 'Ada Boost Classifier (Tuned)')
precision_recall_curve_and_scores(test_X, test_Y, ada_predictions, ada_best_predictions_tuned_prob, 'Ada Boost Classifier (Tuned)')
Accuracy Score Test: 0.7327014218009479 Accuracy Score Train: 0.8222057368941643 (as comparison)
AUC Score (ROC): 0.8482765604627907
F1 Score: 0.47970479704797053 AUC Score (PR): 0.6565012209646841
from sklearn.model_selection import RandomizedSearchCV
import numpy as np
from sklearn.metrics import f1_score
from sklearn.metrics import cohen_kappa_score
import time
start_time = time.time()
#Define parameter grid for RandomizedSearch and instanciate and train model.
param_grid_rf = {'n_estimators': np.arange(10, 2000, 10),
'max_features': ['auto', 'sqrt'],
'max_depth': np.arange(10, 200, 10),
'criterion': ['gini', 'entropy'],
'bootstrap': [True, False]}
rf = RandomForestClassifier(n_estimators = 100,
max_depth = 3,
criterion = "entropy"
)
rf_grid_obj = RandomizedSearchCV(estimator=rf, param_distributions=param_grid_rf, cv=5, verbose=0)
rf_grid_fit = rf_grid_obj.fit(train_X, train_Y)
# Get the estimator
rf_best_clf = rf_grid_fit.best_estimator_
# Make predictions (classes and probabilities) with the trained model on the test set.
# rf_predictions = rf_grid_obj.predict(test_X)
rf_predictions = (rf.fit(train_X, train_Y)).predict(test_X)
rf_best_predictions = rf_best_clf.predict(test_X)
rf_best_predictions_tuned_prob = rf_grid_obj.predict_proba(test_X)
#before hypertuning
rf_before_accuracy=accuracy_score(test_Y, rf_predictions)
rf_before_f1_score=fbeta_score(test_Y, rf_predictions, beta = 0.5)
rf_before_recal = recall_score(test_Y,rf_predictions)
rf_before_precision = precision_score(test_Y,rf_predictions)
rf_before_roc_auc = roc_auc_score(test_Y,rf_predictions)
rf_before_kappa_metric = cohen_kappa_score(test_Y,rf_predictions)
#after hypertuning
rf_hypertuned_accuracy=accuracy_score(test_Y, rf_best_predictions)
rf_hypertuned_f1_score=fbeta_score(test_Y, rf_best_predictions, beta = 0.5)
rf_hypertuned_recal = recall_score(test_Y,rf_best_predictions)
rf_hypertuned_precision = precision_score(test_Y,rf_best_predictions)
rf_hypertuned_roc_auc = roc_auc_score(test_Y,rf_best_predictions)
rf_hypertuned_kappa_metric = cohen_kappa_score(test_Y,rf_best_predictions)
# Report the before-and-afterscores
print ("Un - Optimized Model\n------")
print ("Accuracy Score on Testing Data: {:.4f}".format(rf_before_accuracy))
print ("F-score on Testing Data: {:.4f}".format(rf_before_f1_score))
print ("Recall on Testing Data: {:.4f}".format(rf_before_recal))
print ("Precision on Testing Data: {:.4f}".format(rf_before_precision))
print ("ROC-AUC on Testing Data: {:.4f}".format(rf_before_roc_auc))
print ("Kappa-metric on Testing Data: {:.4f}".format(rf_before_kappa_metric))
print ("\nOptimized Model\n------")
print ("Hypertuned Accuracy Score on the Testing Data: {:.4f}".format(rf_hypertuned_accuracy))
print ("Hypertuned F-score on the Testing Data: {:.4f}".format(rf_hypertuned_f1_score))
print ("Hypertuned Recall on Testing Data: {:.4f}".format(rf_hypertuned_recal))
print ("Hypertuned Precision on Testing Data: {:.4f}".format(rf_hypertuned_precision))
print ("Hypertuned ROC-AUC on Testing Data: {:.4f}".format(rf_hypertuned_roc_auc))
print ("Hypertuned Kappa-metric on Testing Data: {:.4f}".format(rf_hypertuned_kappa_metric))
print (rf_best_clf)
print("--- %s seconds in execution---" % (time.time() - start_time))
####
Un - Optimized Model
------
Accuracy Score on Testing Data: 0.7773
F-score on Testing Data: 0.5407
Recall on Testing Data: 0.2816
Precision on Testing Data: 0.7022
ROC-AUC on Testing Data: 0.6192
Kappa-metric on Testing Data: 0.2947
Optimized Model
------
Hypertuned Accuracy Score on the Testing Data: 0.7957
Hypertuned F-score on the Testing Data: 0.6136
Hypertuned Recall on Testing Data: 0.4938
Hypertuned Precision on Testing Data: 0.6533
Hypertuned ROC-AUC on Testing Data: 0.6994
Hypertuned Kappa-metric on Testing Data: 0.4325
RandomForestClassifier(criterion='entropy', max_depth=10, max_features='sqrt',
n_estimators=1910)
--- 227.83290910720825 seconds in execution---
# Plot model evaluations.
confusion_matrix_plot(train_X, train_Y, test_X, test_Y, rf_grid_obj, predictions, 'Random Forest (Tuned)')
roc_curve_auc_score(test_X, test_Y, rf_best_predictions_tuned_prob, 'Random Forest (Tuned)')
precision_recall_curve_and_scores(test_X, test_Y, predictions, rf_best_predictions_tuned_prob, 'Random Forest (Tuned)')
Accuracy Score Test: 0.7488151658767772 Accuracy Score Train: 0.8738317757009346 (as comparison)
AUC Score (ROC): 0.8349179333685468
F1 Score: 0.6225071225071225 AUC Score (PR): 0.6369684111934477
from sklearn.ensemble import GradientBoostingClassifier
import numpy as np
from sklearn.metrics import f1_score
from sklearn.metrics import cohen_kappa_score
import time
start_time = time.time()
gdb = GradientBoostingClassifier(random_state = 30)
## Set up hyperparameter grid for tuning
## Tune hyperparamters
#sgb_cv = RandomizedSearchCV(clf, param_distributions = sgb_param_grid, cv = 5,
#random_state = 20, n_iter = 20)
gb_parameters = {
"n_estimators":[5,50,250,500],
"max_depth":[1,3,5,7,9],
# "learning_rate":[0.01,0.1,1,10,100]
'learning_rate': [1, 1, 5]
}
gb_grid_obj = GridSearchCV(gdb, gb_parameters,cv=5)
# TODO: Fit the grid search object to the training data and find the optimal parameters
gb_grid_fit = gb_grid_obj.fit(os_smote_X,os_smote_Y)
# Get the estimator
gb_best_clf = gb_grid_fit.best_estimator_
# Make predictions using the unoptimized and model
gb_predictions = (gdb.fit(train_X, train_Y)).predict(test_X)
gb_best_predictions_tuned_prob = gb_grid_obj.predict_proba(test_X)
gb_best_predictions = gb_best_clf.predict(test_X)
#before hypertuning
gb_before_accuracy=accuracy_score(test_Y, gb_predictions)
gb_before_f1_score=fbeta_score(test_Y, gb_predictions, beta = 0.5)
gb_before_recal = recall_score(test_Y,gb_predictions)
gb_before_precision = precision_score(test_Y,gb_predictions)
gb_before_roc_auc = roc_auc_score(test_Y,gb_predictions)
gb_before_kappa_metric = cohen_kappa_score(test_Y,gb_predictions)
#after hypertuning
gb_hypertuned_accuracy=accuracy_score(test_Y, gb_best_predictions)
gb_hypertuned_f1_score=fbeta_score(test_Y, gb_best_predictions, beta = 0.5)
gb_hypertuned_recal = recall_score(test_Y,gb_best_predictions)
gb_hypertuned_precision = precision_score(test_Y,gb_best_predictions)
gb_hypertuned_roc_auc = roc_auc_score(test_Y,gb_best_predictions)
gb_hypertuned_kappa_metric = cohen_kappa_score(test_Y,gb_best_predictions)
# Report the before-and-afterscores
print ("Un - Optimized Model\n------")
print ("Accuracy Score on Testing Data: {:.4f}".format(gb_before_accuracy))
print ("F-score on Testing Data: {:.4f}".format(gb_before_f1_score))
print ("Recall on Testing Data: {:.4f}".format(gb_before_recal))
print ("Precision on Testing Data: {:.4f}".format(gb_before_precision))
print ("ROC-AUC on Testing Data: {:.4f}".format(gb_before_roc_auc))
print ("Kappa-metric on Testing Data: {:.4f}".format(gb_before_kappa_metric))
print ("\nOptimized Model\n------")
print ("Hypertuned Accuracy Score on the Testing Data: {:.4f}".format(gb_hypertuned_accuracy))
print ("Hypertuned F-score on the Testing Data: {:.4f}".format(gb_hypertuned_f1_score))
print ("Hypertuned Recall on Testing Data: {:.4f}".format(gb_hypertuned_recal))
print ("Hypertuned Precision on Testing Data: {:.4f}".format(gb_hypertuned_precision))
print ("Hypertuned ROC-AUC on Testing Data: {:.4f}".format(gb_hypertuned_roc_auc))
print ("Hypertuned Kappa-metric on Testing Data: {:.4f}".format(gb_hypertuned_kappa_metric))
print (gb_best_clf)
print("--- %s seconds in execution---" % (time.time() - start_time))
Un - Optimized Model
------
Accuracy Score on Testing Data: 0.7934
F-score on Testing Data: 0.6086
Recall on Testing Data: 0.5045
Precision on Testing Data: 0.6417
ROC-AUC on Testing Data: 0.7012
Kappa-metric on Testing Data: 0.4319
Optimized Model
------
Hypertuned Accuracy Score on the Testing Data: 0.9303
Hypertuned F-score on the Testing Data: 0.8645
Hypertuned Recall on Testing Data: 0.8806
Hypertuned Precision on Testing Data: 0.8606
Hypertuned ROC-AUC on Testing Data: 0.9145
Hypertuned Kappa-metric on Testing Data: 0.8228
GradientBoostingClassifier(learning_rate=1, max_depth=9, n_estimators=500,
random_state=30)
--- 907.4854693412781 seconds in execution---
# Plot model evaluations.
confusion_matrix_plot(os_smote_X, os_smote_Y, test_X, test_Y, gb_grid_obj, predictions, 'Gradient Boost (Tuned)')
roc_curve_auc_score(test_X, test_Y, gb_best_predictions_tuned_prob, 'Gradient Boost (Tuned)')
precision_recall_curve_and_scores(test_X, test_Y, predictions, gb_best_predictions_tuned_prob, 'Gradient Boost (Tuned)')
Accuracy Score Test: 0.7488151658767772 Accuracy Score Train: 0.9989669421487604 (as comparison)
AUC Score (ROC): 0.9691095054137624
F1 Score: 0.6225071225071225 AUC Score (PR): 0.9130362305935499
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import make_scorer
from sklearn.metrics import fbeta_score, accuracy_score
from sklearn.linear_model import LogisticRegression # Logistic Regression Classifier
from sklearn.model_selection import RandomizedSearchCV
xgb = XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bytree=1, gamma=0, learning_rate=0.9, max_delta_step=0,
max_depth = 7, min_child_weight=1, missing=1, n_estimators=1000,
n_jobs=1, nthread=6, objective='binary:logistic', random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=27,
silent=True, subsample=1,verbosity=0)
parameters = {
'max_depth': range (2, 10, 1),
'n_estimators': range(60, 220, 40),
#'learning_rate': [0.1, 0.01, 0.05]
'learning_rate': [1, 1, 5]
}
# Perform grid search on the classifier using 'scorer' as the scoring method
xgb_grid_obj = GridSearchCV(
estimator=xgb,
param_grid=parameters,
scoring = 'roc_auc',
n_jobs = 10,
cv = 5,
verbose=True
)
# xgb_grid_obj=RandomizedSearchCV(estimator = xgb,
# param_grid=parameters, cv = 5, verbose = False,scoring='roc_auc',n_iter=1000)
# Fit the grid search object to the training data and find the optimal parameters
xgb_grid_fit = xgb_grid_obj.fit(os_smote_X,os_smote_Y,verbose = False)
# Fit the grid search object to the training data and find the optimal parameters
xgb_grid_fit = xgb_grid_obj.fit(os_smote_X,os_smote_Y,verbose = False)
# Get the estimator
xgb_best_clf = xgb_grid_fit.best_estimator_
# View best hyperparameters
#print(xgb_best_clf.best_params_)
# Make predictions using the unoptimized and model
xgb_predictions = (xgb.fit(train_X,train_Y)).predict(test_X)
xgb_best_predictions = xgb_best_clf.predict(test_X)
xgb_best_predictions_tuned_prob = xgb_grid_obj.predict_proba(test_X)
#before hypertuning
xgb_before_accuracy=accuracy_score(test_Y, xgb_predictions)
xgb_before_f1_score=fbeta_score(test_Y, xgb_predictions, beta = 0.5)
xgb_before_recal = recall_score(test_Y,xgb_predictions)
xgb_before_precision = precision_score(test_Y,xgb_predictions)
xgb_before_roc_auc = roc_auc_score(test_Y,xgb_predictions)
xgb_before_kappa_metric = cohen_kappa_score(test_Y,xgb_predictions)
#after hypertuning
xgb_hypertuned_accuracy=accuracy_score(test_Y, xgb_best_predictions)
xgb_hypertuned_f1_score=fbeta_score(test_Y, xgb_best_predictions, beta = 0.5)
xgb_hypertuned_recal = recall_score(test_Y,xgb_best_predictions)
xgb_hypertuned_precision = precision_score(test_Y,xgb_best_predictions)
xgb_hypertuned_roc_auc = roc_auc_score(test_Y,xgb_best_predictions)
xgb_hypertuned_kappa_metric = cohen_kappa_score(test_Y,xgb_best_predictions)
# Report the before-and-afterscores
print ("Un - Optimized Model\n------")
print ("Accuracy Score on Testing Data: {:.4f}".format(xgb_before_accuracy))
print ("F-score on Testing Data: {:.4f}".format(xgb_before_f1_score))
print ("Recall on Testing Data: {:.4f}".format(xgb_before_recal))
print ("Precision on Testing Data: {:.4f}".format(xgb_before_precision))
print ("ROC-AUC on Testing Data: {:.4f}".format(xgb_before_roc_auc))
print ("Kappa-metric on Testing Data: {:.4f}".format(xgb_before_kappa_metric))
print ("\nOptimized Model\n------")
print ("Hypertuned Accuracy Score on the Testing Data: {:.4f}".format(xgb_hypertuned_accuracy))
print ("Hypertuned F-score on the Testing Data: {:.4f}".format(xgb_hypertuned_f1_score))
print ("Hypertuned Recall on Testing Data: {:.4f}".format(xgb_hypertuned_recal))
print ("Hypertuned Precision on Testing Data: {:.4f}".format(xgb_hypertuned_precision))
print ("Hypertuned ROC-AUC on Testing Data: {:.4f}".format(xgb_hypertuned_roc_auc))
print ("Hypertuned Kappa-metric on Testing Data: {:.4f}".format(xgb_hypertuned_kappa_metric))
print (xgb_best_clf)
print("--- %s seconds in execution---" % (time.time() - start_time))
Fitting 5 folds for each of 96 candidates, totalling 480 fits
Fitting 5 folds for each of 96 candidates, totalling 480 fits
Un - Optimized Model
------
Accuracy Score on Testing Data: 0.7635
F-score on Testing Data: 0.5514
Recall on Testing Data: 0.5223
Precision on Testing Data: 0.5592
ROC-AUC on Testing Data: 0.6866
Kappa-metric on Testing Data: 0.3812
Optimized Model
------
Hypertuned Accuracy Score on the Testing Data: 0.9294
Hypertuned F-score on the Testing Data: 0.8608
Hypertuned Recall on Testing Data: 0.8841
Hypertuned Precision on Testing Data: 0.8552
Hypertuned ROC-AUC on Testing Data: 0.9150
Hypertuned Kappa-metric on Testing Data: 0.8210
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
importance_type='gain', interaction_constraints='',
learning_rate=1, max_delta_step=0, max_depth=7,
min_child_weight=1, missing=1, monotone_constraints='()',
n_estimators=180, n_jobs=1, nthread=6, num_parallel_tree=1,
random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
seed=27, silent=True, subsample=1, tree_method='exact',
validate_parameters=1, verbosity=0)
--- 1775.4611022472382 seconds in execution---
# Plot model evaluations.
confusion_matrix_plot(os_smote_X, os_smote_Y, test_X, test_Y, xgb_grid_obj, predictions, 'XGBoost (Tuned)')
roc_curve_auc_score(test_X, test_Y, xgb_best_predictions_tuned_prob, 'XGBoost (Tuned)')
precision_recall_curve_and_scores(test_X, test_Y, predictions, xgb_best_predictions_tuned_prob, 'XGBoost (Tuned)')
Accuracy Score Test: 0.7488151658767772 Accuracy Score Train: 0.9999968650749607 (as comparison)
AUC Score (ROC): 0.9513975435822548
F1 Score: 0.6225071225071225 AUC Score (PR): 0.8885773813443274
from catboost import CatBoostClassifier
import numpy as np
from sklearn.metrics import f1_score
from sklearn.metrics import cohen_kappa_score
import time
start_time = time.time()
parameters = {'depth' : [4,5,6,7,8,9,10],
'learning_rate' : [0.01,0.02,0.03,0.04],
'iterations' : [10,20,30,40,50,60,70,80,90,100]
}
catboost_clf = CatBoostClassifier(cat_features=categorical_columns,
l2_leaf_reg=120, depth=16,
auto_class_weights='Balanced',
iterations=200, learning_rate=0.16,
use_best_model=True,
verbose=False,
one_hot_max_size=31,
early_stopping_rounds=150,
eval_metric='F1', random_state=0)
# cat_grid_obj = GridSearchCV(estimator=catboost_clf, param_grid = parameters, cv = 2, n_jobs=-1)
cat_grid_obj = GridSearchCV(estimator=CatBoostClassifier(), param_grid = parameters, cv = 2, n_jobs=-1)
cat_grid_fit = cat_grid_obj.fit(os_smote_X,os_smote_Y,verbose = False)
# Get the estimator
cat_best_clf = cat_grid_fit.best_estimator_
#cat_predictions = (catboost_clf.fit(os_smote_X, os_smote_Y)).predict(test_X)
cat_predictions = catboost_clf.fit(train_X,train_Y,eval_set=(os_smote_X,os_smote_Y),verbose=False).predict(test_X)
cat_best_predictions_tuned_prob = cat_grid_obj.predict_proba(test_X)
cat_best_predictions = cat_best_clf.predict(test_X)
#before hypertuning
cat_before_accuracy=accuracy_score(test_Y, cat_predictions)
cat_before_f1_score=fbeta_score(test_Y, cat_predictions, beta = 0.5)
cat_before_recal = recall_score(test_Y,cat_predictions)
cat_before_precision = precision_score(test_Y,cat_predictions)
cat_before_roc_auc = roc_auc_score(test_Y,cat_predictions)
cat_before_kappa_metric = cohen_kappa_score(test_Y,cat_predictions)
#after hypertuning
cat_hypertuned_accuracy=accuracy_score(test_Y, cat_best_predictions)
cat_hypertuned_f1_score=fbeta_score(test_Y, cat_best_predictions, beta = 0.5)
cat_hypertuned_recal = recall_score(test_Y,cat_best_predictions)
cat_hypertuned_precision = precision_score(test_Y,cat_best_predictions)
cat_hypertuned_roc_auc = roc_auc_score(test_Y,cat_best_predictions)
cat_hypertuned_kappa_metric = cohen_kappa_score(test_Y,cat_best_predictions)
# Report the before-and-afterscores
print ("Un - Optimized Model\n------")
print ("Accuracy Score on Testing Data: {:.4f}".format(cat_before_accuracy))
print ("F-score on Testing Data: {:.4f}".format(cat_before_f1_score))
print ("Recall on Testing Data: {:.4f}".format(cat_before_recal))
print ("Precision on Testing Data: {:.4f}".format(cat_before_precision))
print ("ROC-AUC on Testing Data: {:.4f}".format(cat_before_roc_auc))
print ("Kappa-metric on Testing Data: {:.4f}".format(cat_before_kappa_metric))
print ("\nOptimized Model\n------")
print ("Hypertuned Accuracy Score on the Testing Data: {:.4f}".format(cat_hypertuned_accuracy))
print ("Hypertuned F-score on the Testing Data: {:.4f}".format(cat_hypertuned_f1_score))
print ("Hypertuned Recall on Testing Data: {:.4f}".format(cat_hypertuned_recal))
print ("Hypertuned Precision on Testing Data: {:.4f}".format(cat_hypertuned_precision))
print ("Hypertuned ROC-AUC on Testing Data: {:.4f}".format(cat_hypertuned_roc_auc))
print ("Hypertuned Kappa-metric on Testing Data: {:.4f}".format(cat_hypertuned_kappa_metric))
print (cat_best_clf)
print("--- %s seconds in execution---" % (time.time() - start_time))
Un - Optimized Model ------ Accuracy Score on Testing Data: 0.7441 F-score on Testing Data: 0.5478 Recall on Testing Data: 0.7540 Precision on Testing Data: 0.5127 ROC-AUC on Testing Data: 0.7472 Kappa-metric on Testing Data: 0.4300 Optimized Model ------ Hypertuned Accuracy Score on the Testing Data: 0.8242 Hypertuned F-score on the Testing Data: 0.6602 Hypertuned Recall on Testing Data: 0.8324 Hypertuned Precision on Testing Data: 0.6277 Hypertuned ROC-AUC on Testing Data: 0.8268 Hypertuned Kappa-metric on Testing Data: 0.5920 <catboost.core.CatBoostClassifier object at 0x1362ab3a0> --- 865.0071070194244 seconds in execution---
# Plot model evaluations.
confusion_matrix_plot(os_smote_X, os_smote_Y, test_X, test_Y, cat_grid_obj, cat_predictions, 'CatBoost (Tuned)')
roc_curve_auc_score(test_X, test_Y, cat_best_predictions_tuned_prob, 'CatBoost (Tuned)')
precision_recall_curve_and_scores(test_X, test_Y, cat_predictions,
cat_best_predictions_tuned_prob, 'CatBoost (Tuned)')
Accuracy Score Test: 0.7440758293838863 Accuracy Score Train: 0.8925619834710744 (as comparison)
AUC Score (ROC): 0.8975878866130642
F1 Score: 0.6103896103896105 AUC Score (PR): 0.7356624514538143
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import make_scorer
import numpy as np
from sklearn.metrics import f1_score
from sklearn.metrics import cohen_kappa_score
import time
start_time = time.time()
bgc = BaggingClassifier(random_state=123)
scorer = make_scorer(fbeta_score,beta=0.5)
param_grid = {
'base_estimator__max_depth' : [1, 2, 3, 4, 5],
'max_samples' : [0.05, 0.1, 0.2, 0.5]
}
bag_grid_obj = GridSearchCV(BaggingClassifier(DecisionTreeClassifier(),
n_estimators = 100, max_features = 0.5),
param_grid, scoring = scorer)
# Fit the grid search object to the training data and find the optimal parameters
bag_grid_fit = bag_grid_obj.fit(os_smote_X,os_smote_Y)
# Get the estimator
bag_best_clf = bag_grid_fit.best_estimator_
# Make predictions using the unoptimized and model
bag_predictions = (bgc.fit(train_X, train_Y)).predict(test_X)
bag_best_predictions = bag_best_clf.predict(test_X)
bag_best_predictions_tuned_prob = bag_grid_obj.predict_proba(test_X)
#before hypertuning
bag_before_accuracy=accuracy_score(test_Y, bag_predictions)
bag_before_f1_score=fbeta_score(test_Y, bag_predictions, beta = 0.5)
bag_before_recal = recall_score(test_Y,bag_predictions)
bag_before_precision = precision_score(test_Y,bag_predictions)
bag_before_roc_auc = roc_auc_score(test_Y,bag_predictions)
bag_before_kappa_metric = cohen_kappa_score(test_Y,bag_predictions)
#after hypertuning
bag_hypertuned_accuracy=accuracy_score(test_Y, bag_best_predictions)
bag_hypertuned_f1_score=fbeta_score(test_Y, bag_best_predictions, beta = 0.5)
bag_hypertuned_recal = recall_score(test_Y,bag_best_predictions)
bag_hypertuned_precision = precision_score(test_Y,bag_best_predictions)
bag_hypertuned_roc_auc = roc_auc_score(test_Y,bag_best_predictions)
bag_hypertuned_kappa_metric = cohen_kappa_score(test_Y,bag_best_predictions)
# Report the before-and-afterscores
print ("Un - Optimized Model\n------")
print ("Accuracy Score on Testing Data: {:.4f}".format(bag_before_accuracy))
print ("F-score on Testing Data: {:.4f}".format(bag_before_f1_score))
print ("Recall on Testing Data: {:.4f}".format(bag_before_recal))
print ("Precision on Testing Data: {:.4f}".format(bag_before_precision))
print ("ROC-AUC on Testing Data: {:.4f}".format(bag_before_roc_auc))
print ("Kappa-metric on Testing Data: {:.4f}".format(bag_before_kappa_metric))
print ("\nOptimized Model\n------")
print ("Hypertuned Accuracy Score on the Testing Data: {:.4f}".format(bag_hypertuned_accuracy))
print ("Hypertuned F-score on the Testing Data: {:.4f}".format(bag_hypertuned_f1_score))
print ("Hypertuned Recall on Testing Data: {:.4f}".format(bag_hypertuned_recal))
print ("Hypertuned Precision on Testing Data: {:.4f}".format(bag_hypertuned_precision))
print ("Hypertuned ROC-AUC on Testing Data: {:.4f}".format(bag_hypertuned_roc_auc))
print ("Hypertuned Kappa-metric on Testing Data: {:.4f}".format(bag_hypertuned_kappa_metric))
print (bag_best_clf)
print("--- %s seconds in execution---" % (time.time() - start_time))
Un - Optimized Model
------
Accuracy Score on Testing Data: 0.7687
F-score on Testing Data: 0.5482
Recall on Testing Data: 0.4278
Precision on Testing Data: 0.5897
ROC-AUC on Testing Data: 0.6600
Kappa-metric on Testing Data: 0.3507
Optimized Model
------
Hypertuned Accuracy Score on the Testing Data: 0.7573
Hypertuned F-score on the Testing Data: 0.5671
Hypertuned Recall on Testing Data: 0.7968
Hypertuned Precision on Testing Data: 0.5290
Hypertuned ROC-AUC on Testing Data: 0.7699
Hypertuned Kappa-metric on Testing Data: 0.4648
BaggingClassifier(base_estimator=DecisionTreeClassifier(max_depth=5),
max_features=0.5, max_samples=0.5, n_estimators=100)
--- 33.06117105484009 seconds in execution---
# Plot model evaluations.
confusion_matrix_plot(os_smote_X, os_smote_Y, test_X, test_Y, bag_grid_obj, bag_predictions, 'Bagging (Tuned)')
roc_curve_auc_score(test_X, test_Y, bag_best_predictions_tuned_prob, 'Bagging (Tuned)')
precision_recall_curve_and_scores(test_X, test_Y, bag_predictions, bag_best_predictions_tuned_prob, 'Bagging (Tuned)')
Accuracy Score Test: 0.7687203791469195 Accuracy Score Train: 0.7847208808190071 (as comparison)
AUC Score (ROC): 0.8487207548081737
F1 Score: 0.49586776859504134 AUC Score (PR): 0.6473556046518667
from lightgbm import LGBMClassifier
from scipy.stats import randint as sp_randint
from scipy.stats import uniform as sp_uniform
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.metrics import fbeta_score, accuracy_score
import numpy as np
from sklearn.metrics import f1_score
from sklearn.metrics import cohen_kappa_score
import time
start_time = time.time()
n_HP_points_to_test = 100
param_test ={'num_leaves': sp_randint(6, 50),
'min_child_samples': sp_randint(100, 500),
'min_child_weight': [1e-5, 1e-3, 1e-2, 1e-1, 1, 1e1, 1e2, 1e3, 1e4],
'subsample': sp_uniform(loc=0.2, scale=0.8),
'colsample_bytree': sp_uniform(loc=0.4, scale=0.6),
'reg_alpha': [0, 1e-1, 1, 2, 5, 7, 10, 50, 100],
'reg_lambda': [0, 1e-1, 1, 5, 10, 20, 50, 100]}
lgbm_c = LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
learning_rate=0.5, max_depth=7, min_child_samples=20,
min_child_weight=0.001, min_split_gain=0.0, n_estimators=100,
n_jobs=-1, num_leaves=500, objective='binary', random_state=None,
reg_alpha=0.0, reg_lambda=0.0, silent=True, subsample=1.0,
subsample_for_bin=200000, subsample_freq=0)
lgb_grid_obj = RandomizedSearchCV(
estimator=lgbm_c, param_distributions=param_test,
n_iter=n_HP_points_to_test,
scoring='roc_auc',
cv=3,
refit=True,
random_state=314,
verbose=True)
#telecom_churn_prediction(lgbm_c,os_smote_X,test_X,os_smote_Y,test_Y,
#cols,"features",threshold_plot = True)
# TODO: Fit the grid search object to the training data and find the optimal parameters
lgb_grid_fit = lgb_grid_obj.fit(os_smote_X,os_smote_Y)
# Get the estimator
lgb_best_clf = lgb_grid_fit.best_estimator_
# Make predictions using the unoptimized and model
lgb_predictions = (lgbm_c.fit(train_X, train_Y)).predict(test_X)
lgb_best_predictions_tuned_prob = lgb_grid_obj.predict_proba(test_X)
lgb_best_predictions = lgb_best_clf.predict(test_X)
#before hypertuning
lgb_before_accuracy=accuracy_score(test_Y, lgb_predictions)
lgb_before_f1_score=fbeta_score(test_Y, lgb_predictions, beta = 0.5)
lgb_before_recal = recall_score(test_Y,lgb_predictions)
lgb_before_precision = precision_score(test_Y,lgb_predictions)
lgb_before_roc_auc = roc_auc_score(test_Y,lgb_predictions)
lgb_before_kappa_metric = cohen_kappa_score(test_Y,lgb_predictions)
#after hypertuning
lgb_hypertuned_accuracy=accuracy_score(test_Y, lgb_best_predictions)
lgb_hypertuned_f1_score=fbeta_score(test_Y, lgb_best_predictions, beta = 0.5)
lgb_hypertuned_recal = recall_score(test_Y,lgb_best_predictions)
lgb_hypertuned_precision = precision_score(test_Y,lgb_best_predictions)
lgb_hypertuned_roc_auc = roc_auc_score(test_Y,lgb_best_predictions)
lgb_hypertuned_kappa_metric = cohen_kappa_score(test_Y,lgb_best_predictions)
# Report the before-and-afterscores
print ("Un - Optimized Model\n------")
print ("Accuracy Score on Testing Data: {:.4f}".format(lgb_before_accuracy))
print ("F-score on Testing Data: {:.4f}".format(lgb_before_f1_score))
print ("Recall on Testing Data: {:.4f}".format(lgb_before_recal))
print ("Precision on Testing Data: {:.4f}".format(lgb_before_precision))
print ("ROC-AUC on Testing Data: {:.4f}".format(lgb_before_roc_auc))
print ("Kappa-metric on Testing Data: {:.4f}".format(lgb_before_kappa_metric))
print ("\nOptimized Model\n------")
print ("Hypertuned Accuracy Score on the Testing Data: {:.4f}".format(lgb_hypertuned_accuracy))
print ("Hypertuned F-score on the Testing Data: {:.4f}".format(lgb_hypertuned_f1_score))
print ("Hypertuned Recall on Testing Data: {:.4f}".format(lgb_hypertuned_recal))
print ("Hypertuned Precision on Testing Data: {:.4f}".format(lgb_hypertuned_precision))
print ("Hypertuned ROC-AUC on Testing Data: {:.4f}".format(lgb_hypertuned_roc_auc))
print ("Hypertuned Kappa-metric on Testing Data: {:.4f}".format(lgb_hypertuned_kappa_metric))
print (lgb_best_clf)
print("--- %s seconds in execution---" % (time.time() - start_time))
Fitting 3 folds for each of 100 candidates, totalling 300 fits
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
[LightGBM] [Warning] Accuracy may be bad since you didn't explicitly set num_leaves OR 2^max_depth > num_leaves. (num_leaves=31).
Un - Optimized Model
------
Accuracy Score on Testing Data: 0.7630
F-score on Testing Data: 0.5485
Recall on Testing Data: 0.5062
Precision on Testing Data: 0.5602
ROC-AUC on Testing Data: 0.6811
Kappa-metric on Testing Data: 0.3737
Optimized Model
------
Hypertuned Accuracy Score on the Testing Data: 0.8773
Hypertuned F-score on the Testing Data: 0.7580
Hypertuned Recall on Testing Data: 0.8217
Hypertuned Precision on Testing Data: 0.7435
Hypertuned ROC-AUC on Testing Data: 0.8596
Hypertuned Kappa-metric on Testing Data: 0.6958
LGBMClassifier(colsample_bytree=0.952164731370897, learning_rate=0.5,
max_depth=7, min_child_samples=111, min_child_weight=0.01,
num_leaves=38, objective='binary', reg_alpha=0, reg_lambda=0.1,
subsample=0.3029313662262354)
--- 14.842597007751465 seconds in execution---
# Plot model evaluations.
confusion_matrix_plot(os_smote_X, os_smote_Y, test_X, test_Y, lgb_grid_obj, lgb_predictions, 'LightGBM (Tuned)')
roc_curve_auc_score(os_smote_X, test_Y, lgb_best_predictions_tuned_prob, 'LightGBM (Tuned)')
precision_recall_curve_and_scores(test_X, test_Y, predictions, lgb_best_predictions_tuned_prob, 'LightGBM (Tuned)')
Accuracy Score Test: 0.7630331753554502 Accuracy Score Train: 0.9888778529151869 (as comparison)
AUC Score (ROC): 0.9249254018175144
F1 Score: 0.6225071225071225 AUC Score (PR): 0.8079651386195097
from sklearn.metrics import f1_score
from sklearn.metrics import cohen_kappa_score
#gives model report in dataframe
def model_report_hypertuned(accuracy_before,accuracy_hypertuned,
f1score_before,f1score_hypertuned,
recall_before,recall_hypertuned,
precision_before,precision_hypertuned,
roc_auc_before,roc_auc_hypertuned,
kappa_metric_before,kappa_metric_hypertuned,name) :
df = pd.DataFrame({"Model Name" : [name],
"Accuracy UnOptimized" : [accuracy_before],
"Accuracy Hypertuned" : [accuracy_hypertuned],
"F1 Score UnOptimized" : [f1score_before],
"F1 Score Hypertuned" : [f1score_hypertuned],
"Recall UnOptimized" : [recall_before],
"Recall Hypertuned" : [recall_hypertuned],
"Precision UnOptimized" : [precision_before],
"Precision Hypertuned" : [precision_hypertuned],
"ROC_AUC Unoptimized" : [roc_auc_before],
"ROC_AUC Hypertuned" : [roc_auc_hypertuned],
"kappa Unoptimized" : [kappa_metric_before],
"Kappa Hypertuned" : [kappa_metric_hypertuned]
})
return df
import time
start_time = time.time()
#outputs for every model
model1_hypertuned = model_report_hypertuned(logit_before_accuracy,logit_hypertuned_accuracy,
logit_before_f1_score,logit_hypertuned_f1_score,
logit_before_recal,logit_hypertuned_recal,
logit_before_precision,logit_hypertuned_precision,
logit_before_roc_auc,logit_hypertuned_roc_auc,
logit_before_kappa_metric,logit_hypertuned_kappa_metric,
"Logistic R(BM)")
model2_hypertuned = model_report_hypertuned(rf_before_accuracy,rf_hypertuned_accuracy,
rf_before_f1_score,rf_hypertuned_f1_score,
rf_before_recal,rf_hypertuned_recal,
rf_before_precision,rf_hypertuned_precision,
rf_before_roc_auc,rf_hypertuned_roc_auc,
rf_before_kappa_metric,rf_hypertuned_kappa_metric,
"Random Forest")
model3_hypertuned = model_report_hypertuned(xgb_before_accuracy,xgb_hypertuned_accuracy,
xgb_before_f1_score,xgb_hypertuned_f1_score,
xgb_before_recal,xgb_hypertuned_recal,
xgb_before_precision,xgb_hypertuned_precision,
xgb_before_roc_auc,xgb_hypertuned_roc_auc,
xgb_before_kappa_metric,xgb_hypertuned_kappa_metric,
"XGBoost")
model4_hypertuned = model_report_hypertuned(gb_before_accuracy,gb_hypertuned_accuracy,
gb_before_f1_score,gb_hypertuned_f1_score,
gb_before_recal,gb_hypertuned_recal,
gb_before_precision,gb_hypertuned_precision,
gb_before_roc_auc,gb_hypertuned_roc_auc,
gb_before_kappa_metric,gb_hypertuned_kappa_metric,
"Grad Boost")
model5_hypertuned = model_report_hypertuned(ada_before_accuracy,ada_hypertuned_accuracy,
ada_before_f1_score,ada_hypertuned_f1_score,
ada_before_recal,ada_hypertuned_recal,
ada_before_precision,ada_hypertuned_precision,
ada_before_roc_auc,ada_hypertuned_roc_auc,
ada_before_kappa_metric,ada_hypertuned_kappa_metric,
"AdaBoost")
model6_hypertuned = model_report_hypertuned(lgb_before_accuracy,lgb_hypertuned_accuracy,
lgb_before_f1_score,lgb_hypertuned_f1_score,
lgb_before_recal,lgb_hypertuned_recal,
lgb_before_precision,lgb_hypertuned_precision,
lgb_before_roc_auc,lgb_hypertuned_roc_auc,
lgb_before_kappa_metric,lgb_hypertuned_kappa_metric,
"LightGBM")
model7_hypertuned = model_report_hypertuned(bag_before_accuracy,bag_hypertuned_accuracy,
bag_before_f1_score,bag_hypertuned_f1_score,
bag_before_recal,bag_hypertuned_recal,
bag_before_precision,bag_hypertuned_precision,
bag_before_roc_auc,bag_hypertuned_roc_auc,
bag_before_kappa_metric,bag_hypertuned_kappa_metric,
"Bagging")
model8_hypertuned = model_report_hypertuned(cat_before_accuracy,cat_hypertuned_accuracy,
cat_before_f1_score,lgb_hypertuned_f1_score,
cat_before_recal,lgb_hypertuned_recal,
cat_before_precision,lgb_hypertuned_precision,
cat_before_roc_auc,lgb_hypertuned_roc_auc,
cat_before_kappa_metric,lgb_hypertuned_kappa_metric,
"CatBoost")
#concat all models +Bagging + LGBM + CatBoost
model_performances_hypertuned = pd.concat([model1_hypertuned,model2_hypertuned,model3_hypertuned,
model4_hypertuned,model5_hypertuned,model6_hypertuned,model7_hypertuned,model8_hypertuned],axis = 0).reset_index()
model_performances_hypertuned = model_performances_hypertuned.drop(columns = "index",axis =1)
model_performances_hypertuned.sort_values(by=['Accuracy Hypertuned'],ascending= [False], inplace=True)
table = ff.create_table(np.round(model_performances_hypertuned,4))
table.update_layout(
autosize=True,
width=2100,
height=200)
py.iplot(table)
print("--- %s seconds in execution ---" % (time.time() - start_time))
--- 0.14574909210205078 seconds in execution ---
model_performances
def output_tracer_hypertuned(metric,color) :
tracer = go.Bar(y = model_performances_hypertuned["Model Name"] ,
x = model_performances_hypertuned[metric],
orientation = "h",name = metric ,
marker = dict(line = dict(width =.7),
color = color)
)
return tracer
layout = go.Layout(dict(title = "Model Performances",
plot_bgcolor = "rgb(243,243,243)",
paper_bgcolor = "rgb(200,243,243)",
xaxis = dict(gridcolor = 'rgb(255, 255, 255)',
title = "metric",
zerolinewidth=1,
ticklen=5,gridwidth=1),
yaxis = dict(gridcolor = 'rgb(255, 255, 255)',
zerolinewidth=1,ticklen=5,gridwidth=1),
# margin = dict(l = 250),
margin=dict(
l=50,
r=50,
b=50,
t=50,
pad=1
),
height = 780,
width=1200
)
)
trace1 = output_tracer_hypertuned('Accuracy Hypertuned',"#33CC99")
trace2 = output_tracer_hypertuned('F1 Score Hypertuned',"#f542e3")
trace3 = output_tracer_hypertuned('Recall Hypertuned',"#FFCC99")
trace4 = output_tracer_hypertuned('ROC_AUC Hypertuned',"red")
trace5 = output_tracer_hypertuned('ROC_AUC Hypertuned',"lightgrey")
trace6 = output_tracer_hypertuned('Kappa Hypertuned',"purple")
data = [trace1,trace2,trace3,trace4,trace5,trace6]
fig = go.Figure(data=data,layout=layout)
# fig.update_layout(
# autosize=True,
# width=1600,
# height=500)
py.iplot(fig)
lst = [xgb_grid_obj,gb_grid_obj,lgb_grid_obj,cat_grid_obj,ada_grid_obj,logit_best_clf,rf_best_clf,bag_grid_obj]
length = len(lst)
mods = ['XGBoost Classifier','Gradient Boosting Classifier','LGBM Classifier',
'Cat Boost','AdaBoost Classifier','Logistic Regression(Baseline_model)',
'Random Forest Classifier','Bagging Classifier']
fig = plt.figure(figsize=(18,15))
fig.set_facecolor("#F3F3F3")
for i,j,k in itertools.zip_longest(lst,range(length),mods) :
plt.subplot(4,5,j+1)
predictions = i.predict(test_X)
conf_matrix = confusion_matrix(predictions,test_Y)
sns.heatmap(conf_matrix,annot=True,fmt = "d",square = True,
xticklabels=["not churn","churn"],
yticklabels=["not churn","churn"],
linewidths = 2,linecolor = "w",cmap = "Set1")
plt.title(k,color = "b")
plt.subplots_adjust(wspace = .3,hspace = .3)
lst = [xgb_grid_obj,gb_grid_obj,lgb_grid_obj,cat_grid_obj,ada_grid_obj,logit_best_clf,rf_best_clf,bag_grid_obj]
length = len(lst)
mods = ['XGBoost Classifier','Gradient Boosting Classifier',
'LGBM Classifier','CatBoost Algorithm','AdaBoost Classifier',
'Logistic Regression(Baseline_model)','Random Forest Classifier','Bagging Classifier']
plt.style.use("dark_background")
fig = plt.figure(figsize=(15,16))
fig.set_facecolor("#F3F3F3")
for i,j,k in itertools.zip_longest(lst,range(length),mods) :
qx = plt.subplot(4,5,j+1)
probabilities = i.predict_proba(test_X)
predictions = i.predict(test_X)
fpr,tpr,thresholds = roc_curve(test_Y,probabilities[:,1])
plt.plot(fpr,tpr,linestyle = "dotted",
color = "royalblue",linewidth = 2,
label = "AUC = " + str(np.around(roc_auc_score(test_Y,predictions),3)))
plt.plot([0,1],[0,1],linestyle = "dashed",
color = "orangered",linewidth = 1.5)
plt.fill_between(fpr,tpr,alpha = .4)
plt.fill_between([0,1],[0,1],color = "k")
plt.legend(loc = "lower right",
prop = {"size" : 12})
qx.set_facecolor("k")
plt.grid(True,alpha = .15)
plt.title(k,color = "b")
plt.xticks(np.arange(0,1,.3))
plt.yticks(np.arange(0,1,.3))
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import average_precision_score
lst = [xgb_grid_obj,gb_grid_obj,lgb_grid_obj,cat_grid_obj,ada_grid_obj,logit_best_clf,rf_best_clf,bag_grid_obj]
length = len(lst)
mods = ['Logistic Regression(Baseline_model)','Random Forest Classifier',
'XGBoost Classifier','Gradient Boosting Classifier','AdaBoost Classifier']
fig = plt.figure(figsize=(15,17))
fig.set_facecolor("#F3F3F3")
for i,j,k in itertools.zip_longest(lst,range(length),mods) :
qx = plt.subplot(4,5,j+1)
probabilities = i.predict_proba(test_X)
predictions = i.predict(test_X)
recall,precision,thresholds = precision_recall_curve(test_Y,probabilities[:,1])
plt.plot(recall,precision,linewidth = 1.5,
label = ("avg_pcn : " +
str(np.around(average_precision_score(test_Y,predictions),3))))
plt.plot([0,1],[0,0],linestyle = "dashed")
plt.fill_between(recall,precision,alpha = .2)
plt.legend(loc = "lower left",
prop = {"size" : 10})
qx.set_facecolor("k")
plt.grid(True,alpha = .15)
plt.title(k,color = "b")
plt.xlabel("recall",fontsize =7)
plt.ylabel("precision",fontsize =7)
plt.xlim([0.25,1])
plt.yticks(np.arange(0,1,.3))
We were able to achieve above 93 % Accuracy , 86 % recall, 88 % precision, and 0.87% F1 by Gradboost Algorithm. This equates to correctly identifying 86 % of customer churn cases, while unnecessarily targeting loyal customers 88% of the time. Assuming reasonable actions are taken, i.e. emailing the customer an offer, this model could be leveraged to improve customer satisfcation. Given the high imbalance of the data towards non-churners, it makes sense to compare F1 scores of 87 % and precision of 88 % , to get the model with the best score on joint accuracy , precision and F1 Score . This would also be the GradBoost and XG Boost Algorithm with a recall score of 86 % & 87 % . XG Boost Algorithm can be further optimized using Hyperband , BayesOpt , Optuna and RayTune for becoming the best model
XGBoost Classifier performed the 2nd best post Hypertuning(High Increase) with Accuracy of → 91 % , Recall Curve → 87 % , F1 Score → 85 % & Precision of → 83 % .This equates to correctly identifying 83 % of customer churn cases, while unnecessarily targeting loyal customers 83 % of the time.Looking at model results, the best accuracy on the test set is achieved by the XG Boost Classifer Algorithm with 91